Abstract
Objective
Weak supervision holds significant promise to improve clinical natural language processing by leveraging domain resources and expertise instead of large manually annotated datasets alone. Here, our objective is to evaluate a weak supervision approach to extract spatial information from radiology reports.
Materials and Methods
Our weak supervision approach is based on data programming that uses rules (or labeling functions) relying on domain-specific dictionaries and radiology language characteristics to generate weak labels. The labels correspond to different spatial relations that are critical to understanding radiology reports. These weak labels are then used to fine-tune a pretrained Bidirectional Encoder Representations from Transformers (BERT) model.
Results
Our weakly supervised BERT model provided satisfactory results in extracting spatial relations without manual annotations for training (spatial trigger F1: 72.89, relation F1: 52.47). When this model is further fine-tuned on manual annotations (relation F1: 68.76), performance surpasses the fully supervised state-of-the-art.
Discussion
To our knowledge, this is the first work to automatically create detailed weak labels corresponding to radiological information of clinical significance. Our data programming approach is (1) adaptable as the labeling functions can be updated with relatively little manual effort to incorporate more variations in radiology language reporting formats and (2) generalizable as these functions can be applied across multiple radiology subdomains in most cases.
Conclusions
We demonstrate a weakly supervision model performs sufficiently well in identifying a variety of relations from radiology text without manual annotations, while exceeding state-of-the-art results when annotated data are available.
Keywords: information extraction, relation extraction, weak supervision, data programming, natural language processing, radiology report
INTRODUCTION
Fine-grained clinical information extracted from radiology reports can be used in various downstream applications including large-scale medical image annotation,1,2 cohort retrieval,3 and automated tracking of findings.4 However, creating a large enough labeled dataset is pivotal for efficiently utilizing advanced deep neural models. In this work, we adapt a weak supervision approach—data programming—to automatically create a large labeled dataset of radiology reports for spatial relation extraction. We further validate the advantage of applying a pretrained language model on the weakly labeled reports, achieving satisfactory performance without hand-labeled training data.
Most information extraction studies in the clinical domain utilize exclusively supervised learning approaches. Such approaches rely on human-annotated reports that are not only tedious, time-consuming, and expensive, but also require extensive domain knowledge. Thus, it is difficult to achieve the scale of manual annotation for complex and fine-grained information. Moreover, manual annotations are often not generalizable across institutions because of limited coverage of language variations and/or reporting style. Meanwhile, deep learning-based supervised methods often demand large amounts of annotated training data to achieve substantial performance improvement over alternatives like rule-based methods. Recent research5–8 has proposed weak supervision to address the above issues by programmatically creating very large training corpora with imperfect labels which have the potential to outperform fully supervised approaches. Such approaches have been applied in clinical natural language processing (NLP) tasks such as medical entity classification,9 concept normalization,10 relation classification,11,12 and sentence-level classification13 for different use cases including patient-centered outcome assessment13 and medical device surveillance.12
One recently explored weak supervision method is data programming,5 which uses multiple supervision sources including patterns, domain rules, and domain-specific ontologies to automatically generate training data. Rules or labeling functions (LFs) (defined based on domain knowledge from these sources) are applied on unlabeled data, the output of which is used to train a generative model to generate probabilistic training labels, thus obviating the laborious process of constructing human-annotated labels. Moreover, the LFs can be easily updated when applied to a different institution’s data to incorporate any change in the downstream use case, or to be consistent with the latest domain knowledge with feedback from subject matter experts. This thereby reduces the manual effort of relabeling data based on revised annotation guidelines.
Inspired by this, we use data programming to automatically construct a weakly labeled corpus of radiology reports following our previously proposed Rad-SpatialNet representation schema.14 This schema covers spatial information related to both radiological findings and medical devices described in reports. In this, a spatial trigger (eg, spatial preposition, verb) evokes a spatial relation between a finding/device and an anatomical location and forms the lexical unit of a spatial frame. Besides finding, device, and location, the other clinically important contextual information (eg, position status of a device, potential diagnosis) that are associated with a spatial relation acts as participants of a frame and are referred to as spatial frame elements (FEs). A brief description of the spatial trigger and the FEs is presented in Table 1. Available ontologies along with domain language patterns can be easily harnessed through data programming to capture these spatial FEs. Thus, spatial relations in radiology reports are an excellent knowledge-heavy setting to evaluate the data programming paradigm.
Table 1.
Description of spatial trigger and spatial frame elements defined in the Rad-SpatialNet schema with example
| Item | Description | Example |
|---|---|---|
| Spatial trigger | The target words (lexical units) of the frames that indicate a spatial relationship exists between a radiological finding and an anatomical location or a medical device and an anatomical location. These are usually prepositions, verbs, and prepositional verbs. | There is hazy opacity of the lung consistent with hyaline membrane disease. |
| Spatial frame elements | ||
| Figure | The object whose location is described through the spatial trigger (usually refers to findings and devices, and sometimes anatomical locations) | There is hazy opacity of the lung consistent with hyaline membrane disease. |
| Ground | The location of the figure described (usually an anatomical structure) | There is hazy opacity of the lung consistent with hyaline membrane disease. |
| Hedge | Uncertainty expressions used by radiologists | There is hazy opacity of the lung consistent with hyaline membrane disease. |
| Diagnosis | Clinical condition or disease associated with a finding suggested as differential diagnoses, usually appears after the hedge-related terms | There is hazy opacity of the lung consistent with hyaline membrane disease. |
| Position status | Any position-related information, usually in context to a medical device | A right PIC catheter terminates in the mid SVC. |
| Relative position | Terms used for describing the orientation of a radiological entity with respect to an anatomical location | The UV line tip is high in the right atrium. |
| Distance | The actual distance of the finding or device from the anatomical location | ETT tube is 1 cm above the carina. |
| Reason | Clinical condition or disease that acts as the source of the finding | A subtle area of increased signal adjacent to the left lateral ventricle at the level of corona radiata could be due to a small lacune. |
| Associated process | Any process or activity associated with a spatial relation | During the movement of the right foot, there is a small area of cortical BOLD activation adjacent to the area of edema. |
Note: The bolded phrase in the example represents the text that acts as the trigger or the frame element.
We use the Snorkel framework5 to automatically create the weak relation labels. Our LFs are based on the radiology-specific lexicon—RadLex,15 regular expressions, language characteristics of report text, and other task-specific heuristics. The generated weak labels are used to train a transformer-based language model, Bidirectional Encoder Representations from Transformers (BERT).16 The overall weak supervision pipeline is shown in Figure 1. To assess our approach, we evaluate BERT that is fine-tuned only using weakly labeled reports. We also evaluate sequential fine-tuning performance (fine-tuning on weak labels followed by gold labels) and compare it with a fully supervised variant. The evaluations are performed on 400 radiology reports (comprising of chest X-ray, brain MRI, and babygram reports) annotated in prior work.14 We demonstrate that a data programming-based weak supervision method produces promising results in extracting spatial information without using any hand-labeled reports for training apart from relying on a small manually annotated dataset of 60 reports (used for developing LFs and tuning the generative and BERT models). The main implications of this work are:
Figure 1.
Overview of our weak supervision approach for radiology spatial information extraction. LF: labeling function; BIO: beginning, inside, outside; Rad-SpDev: development set; Rad-Sp400: held-out test set. x represents the number of unlabeled reports used for training the Label Model (varies from 500 to 50k).
Enabling detailed spatial information extraction from radiology reports without manually labeled training data.
Facilitating the generation of more comprehensive weak information labels at large scale to contribute to on-going research in automatically labeling image datasets and other downstream use cases in radiology.
Holds potential to extend to different radiology subareas (eg, head CT, knee MRI, etc.) for extracting spatial information
RELATED WORK
Numerous work has focused on open-domain NLP tasks using weak supervision. Many studies6–8,17–20 have proposed weak supervision methods for named entity recognition, and a few for other tasks such as natural language generation and understanding21 and discourse structures.22 Recently, there has been increasing work on automatically creating training data and adopting weakly supervised machine learning (ML) methods for NLP tasks in the clinical domain. Wang et al23 developed a rule-based NLP method to create labels for training ML models to classify clinical text. Cusick et al24 proposed a rule-based approach based on NegEx to generate training labels for identifying current suicidal ideation. Dong et al25 adapted a weak supervision approach with rules and contextualized representations to identify rare diseases. Shen et al26 adopted a similar weak supervision approach with BERT where they used a rule-based NLP method to automatically generate training labels for classifying lifestyle factors for Alzheimer’s disease. Banerjee et al13 proposed a weak supervision method where domain-specific dictionaries are used to heuristically generate training labels to classify evidence of urinary incontinence and bowel dysfunction. Callahan et al12 employed data programming and trained LSTM networks for identifying pain-anatomy and implant-complication relations from clinical notes. Peterson et al11 trained a BERT model using weakly labeled data generated through data programming to classify relations (eg, severity, stage, etc.) that can be mapped to FHIR representations. Fries et al9 utilized data programming with BioBERT to classify medical entities and demonstrated comparable results to fully supervised models on multiple benchmark datasets. Very recently, Humbert-Droz et al27 developed a data programming-based weak supervision pipeline using Snorkel to generate weak labels for identifying the presence or absence of symptoms. Moreover, in the biomedical domain, multiple studies have used the Snorkel framework for extracting chemical reaction relationships from biomedical abstracts,28 biomedical relation extraction,29 and filtering biomedical research articles as relevant or nonrelevant for drug repurposing in cancer.30
We see that most studies in the clinical domain use a rule-based approach to create weak labels for binary classification tasks. Differently, our work harnesses the advantages of data programming to generate weak labels for more complex information extraction involving multiple relations. Although Peterson et al11 identified different relations associated with a problem description, their approach assumes a single clinical problem in a description whereas our work extracts FEs related to multiple spatial triggers in a report sentence. Moreover, only a few studies23,31,32 so far have applied weak supervision on radiology report text, 2 of those for binary classification problems (classifying a report as normal vs abnormal31 and identifying hip fracture from report23) while another32 to generate weak anatomical region labels that are subsequently used for training imaging models. Unlike these, this work identifies more detailed information from the reports and thereby generates richer weak labels.
MATERIALS AND METHODS
Data
We use 400 (358 containing spatial relations) MIMIC-III33 radiology reports (chest X-ray: 136, brain MRI: 127, and babygram: 137) manually labeled as per the Rad-SpatialNet schema14 to evaluate our weak supervision pipeline. We refer to this dataset as Rad-Sp400. We randomly select a total of 50k unlabeled MIMIC reports (with an almost equal proportion of the 3 report types) to train the generative model and subsequently the weakly supervised BERTLARGE-MIMIC model. We manually annotate additional randomly selected 60 MIMIC reports (20 in each of the 3 categories) for building dictionaries, LFs, and hyper-parameter tuning (referred to as Rad-SpDev). Keeping into account the time and expertise that goes into manual labeling, and that these reports are used for development purposes, these additional annotations are performed by a physician having sound medical knowledge.
Automatically creating training data
We perform the following sequential steps to programmatically create the weak training labels. For this we employ data programming using the Snorkel framework.5
Generating candidates for spatially related entities
We identify all the candidate {spatial trigger, radiological entity} pairs where the radiological entity acts as a potential spatial FE with respect to the spatial trigger in a sentence. This involves the following steps:
Dictionary construction—We curate 2 dictionaries using preexisting knowledge sources—(1) Rad-Entitydict: This contains different types of radiological entities such as radiological findings and anatomical entities using RadLex.15 For this, all RadLex terms under the parent RadLex classes Imaging observation (RID5), Clinical finding (RID34785), Anatomical entity (RID3), Medical device (RID29033), and Process (RID39128) are used for constructing a comprehensive vocabulary representing important radiological entities. Additionally, we also add the terms in Foundational Model of Anatomy (FMA) ontology34 to include more anatomical entities and add radiology-specific acronyms and their corresponding expansions from Radiopaedia.35 This results in a total of 153 944 terms. (2) Spatial-Triggerdict: This contains potential phrases denoting spatial relations between finding/device and location. We develop this by combining the spatial triggers annotated in Rad-SpDev to a set of hand-built trigger terms.
Expanding Rad-Entitydict—We manually add more finding and anatomy-related terms to Rad-Entitydict that are encountered in Rad-SpDev but are not present in RadLex or FMA (eg, effacement, volume loss, caudate nucleus head). Based on patterns identified using Rad-SpDev, we further prepend or append phrases to a set of terms in Rad-Entitydict. Specifically, we prepend in 2 ways—(1) prepending phrases such as “area(s) of”, “region(s) of”, and “focus/foci of” to finding-specific terms (eg, hypodensity) and (2) prepending descriptors to certain finding and anatomical entities (eg, prepend “petechial” and “intraparenchymal” to a finding term hemorrhage and prepend combinations of 2 brain lobes such as “frontoparietal” to terms like lobe and cortex). Finally, we add the plural forms of all terms to the dictionary. For including terms related to “RelativePosition”, “PositionStatus”, and “Hedge” FEs to Rad-Entitydict, we construct a list of terms using both Rad-SpDev and manually curated terms. Additionally, for “RelativePosition” and “Hedge”, we add all RadLex terms under RadLex class Location descriptor (RID5817) and Certainty descriptor (RID29), respectively. This increases the total number of terms to 1 492 109.
Entity tagging—We apply an entity tagger that extracts all possible text spans in a sentence representing any spatial FE by exactly matching against the terms in Rad-Entitydict. For the identified spans having any overlap, the longest span is selected except for a few special cases. Such exceptions include anatomy entities (eg, inferior cerebellar peduncle) that contain location descriptor-related terms (inferior) in which cases we select both inferior and the main anatomical entity cerebellar peduncle as 2 candidate entity spans instead of selecting the longest span. Similarly, candidate spatial triggers are identified using dictionary matching against terms in Spatial-Triggerdict. For entities representing “Distance” FE (eg, “2 mm”), regular expressions (inspired from Bozkurt et al36) are applied for matching. Besides “Distance”, we also develop regular expressions for identifying certain anatomical entities representing segments (eg, “C5-C7”, “T12”).
Finally, all possible {trigger, entity} pairs are generated by combining each identified trigger with all identified radiological entities in a sentence. All these pairs form the candidate spatially related entities.
Developing labeling functions
This step involves writing rules or LFs considering both radiology report-specific language characteristics and domain lexicons to vote on a {trigger, entity} pair’s potential FE label (from a set of 9 labels corresponding to 9 spatial FEs). Given a {trigger, entity} pair as input, each LF either assigns an FE label for the pair or abstains (ie, assigns no label). Most LFs include combining dictionary-matching and task-specific heuristics. Dictionaries used in LFs are constructed in a similar manner as for the entity tagging step (described in the above section) for broad categories such as finding, device, and anatomy. Matching against terms in specific dictionaries constrains the semantic type of an entity whereas the task-specific heuristics captures prominent cues to identify the potential spatial role of an entity with respect to a spatial trigger by using linguistic features of a sentence documenting any important clinical information about radiological findings. Examples of heuristics used in LFs to vote a candidate {trigger, entity} pair as “Ground” and “Diagnosis” FE are illustrated in Table 2.
Table 2.
Heuristics used in 2 sample LFs to label a {trigger (SpTrg), entity (RadEnt)} pair with Ground and Diagnosis frame element relations
| FE (features used in LF) | Example sentence | Heuristics |
|---|---|---|
|
Ground
(Relative position of RadEnt with respect to SpTrg; semantic type of RadEnt) |
The lungsdemonstrate hazy bilateral opacity of hyaline membrane disease. |
|
|
There are scattered T2 high signal intensity foci in the periventricular white matter and centrum semi-ovale consistent with microvascular angiopathy. | RadEnt lies to the right of SpTrg AND
|
|
A patchy area of consolidation is seen within the right lower lobe concerning for pneumonia. | RadEnt is finding AND text span to the right of RadEnt is “.” AND
|
|
There is stable opacity in the right lower lobe as well as a retrocardiac opacity, these are likely related to atelactases versus pneumonia. | left window = [represent, suggest, indicat, consistent with]; right window = [ruled out, excluded, vs, versus]
|
Note: SpTrgs are bolded and FEs are underlined.
RadEnt: radiological entity; SpTrg: spatial trigger; FE: frame element; LF: labeling function.
Besides relying on domain ontologies (RadLex, FMA) through dictionary match, the task-specific heuristics are necessary for a complicated task like this where identifying a spatial role (or FE) against a radiological entity is context-dependent (eg, a finding term could be both “Figure” and “Diagnosis” depending on what role it plays in a sentence). This becomes more challenging when there are multiple spatial triggers in a sentence and the same entity is associated with different triggers with different spatial roles (eg, an anatomical entity could be both a “Figure” and “Ground”). Our LFs handle this complexity by considering the position of an entity with respect to a specific spatial trigger in a sentence. We manually examine the sentences in Rad-SpDev to build the LFs. The LFs are developed and refined in iteration by evaluating on the annotated Rad-SpDev set. We develop 19 LFs in total. The heuristics used in all LFs are included in the Supplementary Table S1.
Applying generative model and creating weak labeled data
We use Snorkel’s generative model (known as a Label model) that combines the noisy label outputs from all LFs for a {trigger, entity} pair by estimating the unobserved accuracy of each LF to assign a single probabilistic label for that pair. This generates probablistic training labels (or “weak” labels) for all candidate pairs extracted from the unlabeled report sentences. Since our task is to identify the FEs at the level of each spatial trigger in a sentence, we create separate instance for each trigger by combining all the RadEnts from the {trigger, entity} pairs for which an FE label has been predicted. These modified trigger-level instances are used for further processing.
Filtering weak labels
We apply 2 additional constraints to filter the weak labels generated by the Label model to produce a sizable improved weakly labeled training data. First, since “Figure” and “Ground” constitute the 2 fundamental FEs of a spatial frame, we check for the presence of both these FEs among the weak FE label predictions in the trigger-level instances from the above step. Second, we check for the presence of certain frequent phrases surrounding common spatial triggers such as of, with, and in. We ensure that no such phrase is found to the left or right of the trigger. This rule eliminates false positive (FP) triggers (the frequent phrase sets are taken from a previous work37). Only if the 2 constraints satisfy, we select that trigger-level instance in our final weak labeled training set.
Weakly supervised model—BERT
We use the final weak labeled training data to fine-tune BERTLARGE-MIMIC (pretrained on MIMIC notes for 300K steps38). We formulate this as a sequence labeling task where we extract spatial FEs provided a spatial trigger in a sentence. For this, we convert each trigger-level instance produced from Label model to a sequence of Beginning-B, Inside-I, and Outside-O (BIO) labels against each word in a sentence. The process of filtering the weak labels and transforming to BIO tag sequence is shown in Figure 2. Each sentence, represented using the standard input sequence format—([CLS] sentence [SEP]), is then fed into BERT. As there can be multiple triggers in a sentence, we mask the words corresponding to a spatial trigger with an identifier $sptrg$to encode the position of a specific trigger. The contextual representations from the BERT encoder output is fed into a linear classification layer to predict label per word.
Figure 2.
Filtering weak labels and converting the labels to feed into BERT model. All the candidate spatial triggers are shown in bold.
Experimental setups
Using no gold data
We use varying amounts of unlabeled MIMIC-III radiology reports to generate weak spatial labels and then use these labels to fine-tune the BERTLARGE-MIMIC model. Specifically, we use 10%, 25%, 50%, 75%, and 100% of the 50k selected MIMIC reports. We evaluate each variant on the 358 gold annotated test reports (Rad-Sp400).
Sequential fine-tuning
We perform sequential fine-tuning where we first fine-tune BERT on weakly labeled reports followed by fine-tuning on gold annotated reports (a similar approach that proved to be effective in a recent work by Smit et al39 to improve the performance of the automatic rule-based labeler like CheXpert40). Specifically, we leverage the best BERTLARGE-MIMIC model variant among the 5 variants trained on only weak labels to initialize the model parameters to further fine-tune on gold reports. A total of 358 gold annotated reports are used in this experiment. We report the average F1 measures of a 10-fold cross-validation where 80% of the 358 reports are used for the fine-tuning process, 10% for validation, and 10% for testing. We use the results (average F1 using predicted triggers) of a fully supervised BERTLARGE-MIMIC sequence labeling model reported in a previous work14 for direct comparison. Note that here we use the same 10-fold split settings as used in the fully supervised experiment in our previous work.
Using varying gold data
We also experiment using increasing amounts of gold annotated reports for sequential fine-tuning. Similar to the sequential fine-tuning experiment, the best trained model on weak labels is used to further fine-tune on the gold reports. We use 10% of 358 annotated reports, each for testing and development, and the remaining 80% (ie, 288 reports) for training. Specifically, we use 10%, 25%, 50%, 75%, and 100% of the 288 gold reports for sequential fine-tuning.
Evaluation
Since our approach involves a pipeline of 2 models (spatial triggers and FEs), to evaluate our weakly supervised BERT model’s end-to-end performance, we consider the spatial triggers that are predicted by the Label model on the test data (Rad-Sp400). We apply the 2 constraints (refer section “Filtering weak labels”) to form the model input to BERT. Here, we apply the first constraint with a slight modification (ie, we filter the trigger-level instances for which a Ground FE is predicted by the Label model) in order to increase the recall of spatial triggers as the fine-tuning task uses the trigger positions to predict the associated FEs. We take into account the precision loss related to FEs that are predicted for FP spatial triggers and recall loss related to FEs that are missed for false negative (FN) triggers. These FP and FN triggers are based on the predictions of the Label model on Rad-Sp400.
The hyperparameters for both the Label and BERTLARGE-MIMIC models are tuned using the 60 annotated reports in Rad-SpDev through grid search. For Label model, the number of training epochs, learning rate, L2 regularization, and precision initialization are set at 100, 0.0001, 0.01, and 0.7, respectively. For BERTLARGE-MIMIC fine-tuning, the maximum sequence length, learning rate, training epochs are chosen as 128, 2e − 5, and 4, respectively and we use the cased version of the model.
RESULTS
The coverage of the candidate generation phase is 78.3%, or in other words, our candidate generator identifies 78.3% of the total gold {spatial trigger, radiological entity} pairs from Rad-Sp400. The performance measures of the weakly supervised BERTLARGE-MIMIC model using increasing amounts of weakly labeled reports are presented in Table 3. We see that the best overall F1 on Rad-Sp400 is obtained when 37.5k weakly labeled reports are used. The precision, recall, and F1 measure for identifying the spatial triggers on Rad-Sp400 are 70.84%, 75.07%, and 72.89, respectively. We additionally present the results for 500 and 1k weakly labeled reports (in Table 3) to highlight the performance trend even when less than 10% of reports are used in fine-tuning.
Table 3.
F1 measures of the weakly supervised BERTLARGE-MIMIC model on Rad-Sp400
| Frame elements | # Weak labeled reports for training |
||||||
|---|---|---|---|---|---|---|---|
| 500 | 1k | 5k | 12.5k | 25k | 37.5k | 50k | |
| Ground | 54.16 | 61.90 | 62.62 | 63.25 | 63.53 | 63.07 | 63.06 |
| Figure | 32.56 | 41.12 | 45.47 | 45.18 | 44.66 | 45.93 | 45.37 |
| Relative position | 50.36 | 50.15 | 48.34 | 48.88 | 49.13 | 49.22 | 49.17 |
| Hedge | 42.75 | 41.34 | 53.65 | 50.51 | 51.84 | 55.00 | 56.38 |
| Diagnosis | 20.77 | 25.05 | 29.62 | 31.23 | 33.55 | 33.26 | 34.12 |
| Position status | 43.75 | 45.08 | 42.66 | 38.32 | 39.88 | 40.92 | 41.04 |
| Distance | 70.48 | 64.46 | 65.49 | 63.79 | 66.67 | 66.67 | 67.31 |
| Reason | 0 | 0 | 19.51 | 27.69 | 28.12 | 30.77 | 31.43 |
| Overall | 43.40 | 49.01 | 51.64 | 51.64 | 51.92 | 52.47 | 52.43 |
Note: All the values for the “Associated Process” frame element are zero.
Table 4 shows the sequential fine-tuning results when the model checkpoint corresponding to using 37.5k weakly labeled reports are used to further fine-tune on gold reports. We note that the sequential fine-tuning helps to improve the performance for all spatial FEs except for the “Relative Position” FE when compared to the fully-supervised variant. We observe that this improvement is higher for the less frequent FEs (as can be seen from the Count column in Table 4). To demonstrate the effect of increasing size of gold annotated data on the model’s performance, we present the sequential fine-tuning results with varying gold reports in Table 5 on a randomly selected 35 annotated test reports. As expected, the performance of the BERTLARGE-MIMIC model improves as the gold data size is increased, however, we observe that the highest or a comparable overall F1 measure is achieved using 75% (ie, 213 reports) of the total annotated reports available for training. Also note that the results for the “Reason” and “Associated Process” FEs are zero in many cases as these are found very infrequently in the dataset.
Table 4.
Average F1 measures of BERTLARGE-MIMIC model over 10-fold CV through sequential fine-tuning (using the model checkpoint obtained after fine-tuning on weak labels of 37.5k reports)
| Frame elements | Precision (%) | Recall (%) | F1 | FS–F1 | Count |
|---|---|---|---|---|---|
| Ground | 74.64 | 69.33 | 71.74 | 71.51 | 1537 |
| Figure | 71.25 | 64.90 | 67.69 | 65.12 | 1491 |
| Relative position | 65.99 | 65.18 | 64.96 | 66.33 | 398 |
| Hedge | 67.60 | 61.87 | 64.04 | 57.82 | 249 |
| Diagnosis | 58.12 | 56.14 | 56.54 | 50.76 | 190 |
| Position status | 66.18 | 71.22 | 68.05 | 60.37 | 167 |
| Distance | 88.00 | 90.83 | 88.36 | 88.05 | 45 |
| Reason | 53.33 | 51.29 | 45.19 | 0 | 33 |
| Associated process | 60.00 | 45.00 | 50.00 | 0 | 21 |
| Overall | 71.26 | 66.71 | 68.76 | 66.25 | 3131 |
FS–F1: fully supervised F1 measures.
Table 5.
Sequential fine-tuning results (F1 measures) of BERTLARGE-MIMIC (using the model checkpoint obtained after fine-tuning on weak labels of 37.5k reports) on randomly selected 35 test reports with increasing amount of gold reports used in the fine-tuning process
| Frame elements | % of gold reports used |
||||
|---|---|---|---|---|---|
| 10 | 25 | 50 | 75 | 100 | |
| Ground | 64.52 | 68.09 | 66.32 | 66.31 | 67.38 |
| Figure | 51.72 | 54.75 | 56.82 | 59.55 | 57.78 |
| Relative position | 53.66 | 60.47 | 58.54 | 55.00 | 55.81 |
| Hedge | 55.00 | 68.18 | 76.92 | 76.92 | 73.17 |
| Diagnosis | 48.48 | 54.55 | 53.33 | 70.97 | 62.50 |
| Position status | 54.55 | 61.54 | 61.54 | 57.14 | 66.67 |
| Distance | 40.00 | 33.33 | 66.67 | 57.14 | 75.00 |
| Associated process | 0 | 0 | 100 | 100 | 100 |
| Overall | 56.05 | 60.43 | 62.03 | 63.62 | 63.01 |
Note: All the values for the “Reason” frame element are zero. One hundred percent corresponds to 288 gold reports.
DISCUSSION
We develop a weakly supervised pipeline based on data programming technique to extract spatial relations from radiology text. This is an early attempt to automatically create weak labels in the radiology domain covering detailed and important spatial information of clinical importance that could be used for various clinical informatics applications, unlike the 3 previous studies that employed weak supervision for simpler binary classification and anatomy tagging.23,31,32 The results in Table 3 demonstrate that our proposed pipeline performs decently given the complexity of information extracted and without any reliance on the time-consuming and expensive manual labeling process. Although they do not surpass the fully supervised model’s performance, they hold the potential to identify a variety of important clinical information without using any hand-labeled training data. Further, our findings on sequential fine-tuning (Table 4) also reflect the advantages of leveraging a MIMIC pretrained model first fine-tuned on domain and task specific data (weakly labeled radiology reports) and then on gold annotated data instead of just fine-tuning on the gold data. This is in line with the findings demonstrated for the CheXbert model where combining the annotations of a rule-based labeler and expert annotations resulted in better performance.39 Additionally, Table 4 indicates that leveraging weak supervision boosts the performance in identifying rare FEs.
In general, there is a major trend in NLP (both open-domain and clinical) at the moment focusing on models that work on small amounts of data. Few-shot learning (eg, with prompts) is one option, weak supervision is another, each with pros/cons (weak supervision allows for more direct injection of knowledge). It is scientifically critical that these options get fleshed out since different methods may be more appropriate under different conditions.
Our weak supervision approach provides sufficient flexibility as the LFs can be fairly easily modified to incorporate any change in reporting style and other institutional reports.
For example, let us consider the following 2 reporting styles:
There are calcified atherosclerotic changes in the brain parenchyma.
Brain parenchyma: There are calcified atherosclerotic changes.
While the first style is more commonly encountered in radiology reports to describe findings (eg, atherosclerotic changes) and their locations (eg, brain parenchyma), some institutions or radiologists may prefer the second format (ie, location: findings). Such changes in reporting format may necessitate some updates in the LFs which could be easily incorporated as and when needed. Moreover, as new FEs are added to the representation schema for different downstream use cases, we can add additional LFs to cover those. Additionally, as the LFs are developed using the more general language characteristics of radiology text and the finding/anatomy dictionaries are primarily based on standard medical ontologies (eg, FMA), they are mostly generalizable, that is, they hold potential in identifying the spatial FEs belonging to different imaging modalities and human anatomies (beyond the 3 report types used in this work). For instance, our LFs will be able to identify spatial information from a pelvic ultrasound report sentence as well (eg, identifying the finding leiomyomas and its location uterus from the sentence—“Multiple leiomyomas in the uterus.”). This is not explored in this work, however, we plan to do a thorough analysis to examine the performance of our weak supervision pipeline when applied to reports of multiple institutions, modalities, and anatomies in the future.
Although the dictionaries we developed in this work are comprehensive enough (at least for the 3 radiology subareas considered in this article), we intend to further improve the coverage of the candidate generation step that generates the candidate {spatial trigger, radiological entity} pairs by further expanding the terms in the dictionaries that can detect more variation of radiological entities. The coverage is mainly impacted because of misspellings and the dictionaries lacking less common phrase variations representing findings and anatomies. Some of such phrases include findings such as gastric distention, gross formational abnormalities, low attenuation structure, signal gap, and mesenteric stranding and anatomical locations such as portal vein, deep venous system, mediastinal margin, cavernous carotid, and antecubital fossa. This also reflects the challenges involved in creating more robust dictionaries as there are different and many possible ways of expressing radiological entities and we leave this to future work.
CONCLUSION
We propose a data programming-based weak supervision method to identify spatial triggers and spatial FEs from radiology reports. The performance in extracting the triggers and elements are satisfactory, with F1 measures of 72.89 and 52.47, respectively. This is achieved without using any manually labeled reports for training. Our results also indicate that sequential fine-tuning using MIMIC pretrained BERT model, first on weakly labeled reports and then on gold reports results in better performance compared to a fully supervised MIMIC BERT model.
Supplementary Material
Contributor Information
Surabhi Datta, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Kirk Roberts, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
FUNDING
This work was supported in part by the National Institute of Biomedical Imaging and Bioengineering (NIBIB: R21EB029575), the Patient-Centered Outcomes Research Institute (PCORI: ME-2018C1-10963), and the Cancer Prevention and Research Institute of Texas Predoctoral Fellowship (CPRIT RP210045).
AUTHOR CONTRIBUTIONS
SD and KR conceived the methodology. SD implemented the pipeline, carried out the experiments, and drafted the initial manuscript. KR supervised the study and acquired funding. SD and KR edited the manuscript and approved the final version.
SUPPLEMENTARY MATERIAL
Supplementary material is available at JAMIA Open online.
CONFLICT OF INTEREST STATEMENT
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
DATA AVAILABILITY
The instructions to access the weakly labeled dataset can be found at: https://github.com/krobertslab/datasets/tree/master/rad-weak-supervision.
REFERENCES
- 1. Syeda-Mahmood T, Wong KCL, Wu JT, et al. Extracting and learning fine-grained labels from chest radiographs. AMIA Annu Symp Proc 2020; 2020: 1190–9. [PMC free article] [PubMed] [Google Scholar]
- 2. Bradshaw T, Weisman A, Perlman S, et al. Automatic image classification using labels from radiology text reports: predicting Deauville scores. J Nucl Med 2020; 61: 1410. [Google Scholar]
- 3. Wheater E, Mair G, Sudlow C, et al. A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records. BMC Med Inform Decis Mak 2019; 19 (1): 184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Rubin DL, Willrett D, O’Connor MJ, et al. Automated tracking of quantitative assessments of tumor burden in clinical trials. Transl Oncol 2014; 7 (1): 23–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ratner A, Bach SH, Ehrenberg H, et al. Snorkel: rapid training data creation with weak supervision. VLDB J 2020; 29 (2): 709–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Fries J, Wu S, Ratner A, et al. SwellShark: a generative model for biomedical named entity recognition without labeled data. 2017. doi: 10.48550/arXiv.1704.06360. [DOI]
- 7. Shang J, Liu L, Gu X, et al. Learning named entity tagger using domain-specific dictionary. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018: 2054–64. doi: 10.18653/v1/D18-1230. [DOI] [Google Scholar]
- 8. Safranchik E, Luo S, Bach SH. Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, NY: AAAI Press; 2020: 5570–8. doi: 10.1609/aaai.v34i04.6009. [DOI]
- 9. Fries JA, Steinberg E, Khattar S, et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 2021; 12 (1): 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Pattisapu N, Anand V, Patil S, et al. Distant supervision for medical concept normalization. J Biomed Inform 2020; 109: 103522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Peterson KJ, Jiang G, Liu H.. A corpus-driven standardization framework for encoding clinical problems with HL7 FHIR. J Biomed Inform 2020; 110: 103541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Callahan A, Fries JA, Ré C, et al. Medical device surveillance with electronic health records. NPJ Digit Med 2019; 2 (1): 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Banerjee I, Li K, Seneviratne M, et al. Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment. JAMIA Open 2019; 2 (1): 150–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Datta S, Ulinski M, Godfrey-Stovall J, et al. Rad-SpatialNet: a frame-based resource for fine-grained spatial relations in radiology reports In: Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2020: 2251–60. https://www.aclweb.org/anthology/2020.lrec-1.274. [PMC free article] [PubMed] [Google Scholar]
- 15. Langlotz CP. RadLex: a new method for indexing online educational materials. Radiographics 2006; 26 (6): 1595–7. [DOI] [PubMed] [Google Scholar]
- 16. Devlin J, Chang M-W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN: Association for Computational Linguistics; 2019: 4171–86. doi: 10.18653/v1/N19-1423. [DOI]
- 17. Li Y, Shetty P, Liu L, et al. BERTifying the Hidden Markov Model for multi-source weakly supervised named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics; 2021: 6178–90. doi: 10.18653/v1/2021.acl-long.482. [DOI] [Google Scholar]
- 18. Lison P, Barnes JHubin A. et al. Named entity recognition without labelled data: a weak supervision approach. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online: Association for Computational Linguistics; 2020: 1518–33. doi: 10.18653/v1/2020.acl-main.139. [DOI]
- 19. Zeng X, Li Y, Zhai Y, et al. Counterfactual generator: a weakly-supervised method for named entity recognition. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics; 2020: 7270–80. doi: 10.18653/v1/2020.emnlp-main.590. [DOI]
- 20. Zhao X, Ding H, Feng Z. GLaRA: graph-based labeling rule augmentation for weakly supervised named entity recognition. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics; 2021:. 3636–49. doi: 10.18653/v1/2021.eacl-main.318. [DOI]
- 21. Chang E, Demberg V, Marin A. Jointly improving language understanding and generation with quality-weighted weak supervision of automatic labeling. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics; 2021: 818–29. doi: 10.18653/v1/2021.eacl-main.69. [DOI]
- 22. Badene S, Thompson K, Lorré J-P, et al. Data programming for learning discourse structure. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019: 640–5. doi: 10.18653/v1/P19-1061. [DOI]
- 23. Wang Y, Sohn S, Liu S, et al. A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Mak 2019; 19 (1): 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Cusick M, Adekkanattu P, Campion TR, et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res 2021; 136: 95–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Dong H, Suárez-Paniagua V, Zhang H, et al. Rare disease identification from clinical notes with ontologies and weak supervision. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 2021: 2294–8. doi: 10.1109/EMBC46164.2021.9630043. [DOI] [PubMed] [Google Scholar]
- 26. Shen Z, Yi Y, Bompelli A, et al. Extracting lifestyle factors for Alzheimer’s disease from clinical notes using deep learning with weak supervision. 2021. doi: 10.48550/arXiv.2101.09244. [DOI] [PMC free article] [PubMed]
- 27. Humbert-Droz M, Mukherjee P, Gevaert O.. Strategies to address the lack of labeled data for supervised machine learning training with electronic health records: case study for the extraction of symptoms from clinical notes. JMIR Med Inform 2022; 10 (3): e32903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Mallory EK, de Rochemonteix M, Ratner A, et al. Extracting chemical reactions from text using Snorkel. BMC Bioinformatics 2020; 21 (1): 217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Krasakis AM, Kanoulas E, Tsatsaronis G. Semi-supervised ensemble learning with weak supervision for biomedical relationship extraction. In: Automated Knowledge Base Construction (AKBC). Amherst, MA;2019. doi: 10.24432/C57P4G. [DOI]
- 30. Dua S, Baldini I, Katz-Rogozhnikov DA, et al. Biomedical corpus filtering: a weak supervision paradigm with infused domain expertise. In: Scientific Document Understanding (SDU)@AAAI Conference on Artificial Intelligence.2021. https://ceur-ws.org/Vol-2831/paper14.pdf.
- 31. Dunnmon JA, Ratner AJ, Saab K, et al. Cross-modal data programming enables rapid medical machine learning. Patterns N Y N 2020; 1 (2): 100019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Eyuboglu S, Angus G, Patel BN, et al. Multi-task weak supervision enables anatomically-resolved abnormality detection in whole-body FDG-PET/CT. Nat Commun 2021; 12 (1): 1880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Rosse C, Mejino JLV.. The foundational model of anatomy ontology. In: Burger A, Davidson D, Baldock R, eds. Anatomy Ontologies for Bioinformatics: Principles and Practice. London: Springer; 2008: 59–117. doi: 10.1007/978-1-84628-885-2_4. [DOI] [Google Scholar]
- 35. Bell D, Weerakkody Y, Glick Y, et al. Medical abbreviations and acronyms (A). Reference article. Radiopaedia.org. 10.53347/rID-54589. Accessed March 15, 2023. [DOI] [Google Scholar]
- 36. Bozkurt S, Alkim E, Banerjee I, et al. Automated detection of measurements and their descriptors in radiology reports using a hybrid natural language processing algorithm. J Digit Imaging 2019; 32 (4): 544–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Datta S, Roberts K.. A hybrid deep learning approach for spatial trigger extraction from radiology reports. Proc Conf Empir Methods Nat Lang Process 2020; 2020: 50–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Si Y, Wang J, Xu H, et al. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc 2019; 26 (11): 1297–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Smit A, Jain S, Rajpurkar P, et al. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020: 1500–19. https://www.aclweb.org/anthology/2020.emnlp-main.117
- 40. Irvin J, Rajpurkar P, Ko M, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019:. 590–7. doi: 10.1609/aaai.v33i01.3301590. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The instructions to access the weakly labeled dataset can be found at: https://github.com/krobertslab/datasets/tree/master/rad-weak-supervision.


