Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2021 Jan 25;2020:1305–1314.

AI Accelerated Human-in-the-loop Structuring of Radiology Reports

Joy T Wu 1, Ali Syed 1,2, Hassan Ahmad 1, Anup Pillai 1, Yaniv Gur 1, Ashutosh Jadhav 1, Daniel Gruhl 1, Linda Kato 1, Mehdi Moradi 1, Tanveer Syeda-Mahmood 1
PMCID: PMC8075499  PMID: 33936507

Abstract

Rule-based Natural Language Processing (NLP) pipelines depend on robust domain knowledge. Given the long tail of important terminology in radiology reports, it is not uncommon for standard approaches to miss items critical for understanding the image. AI techniques can accelerate the concept expansion and phrasal grouping tasks to efficiently create a domain specific lexicon ontology for structuring reports. Using Chest X-ray (CXR) reports as an example, we demonstrate that with robust vocabulary, even a simple NLP pipeline can extract 83 directly mentioned abnormalities (Ave. recall=93.83%, precision=94.87%) and 47 abnormality/normality descriptions of key anatomies. The richer vocabulary enables identification of additional label mentions in 10 out of 13 labels (compared to baseline methods). Furthermore, it captures expert insight into critical differences between observed and inferred descriptions, and image quality issues in reports. Finally, we show how the CXR ontology can be used to anatomically structure labeled output.

INTRODUCTION

The radiology community is looking for automatic solutions that would enhance their reporting workflow to tackle problems such as delayed reporting and eye-fatigue.1 Automatic structured "preliminary reporting" of imaging exams is one avenue to revolutionize radiology workflow by helping radiologists produce and finalize exam reports faster. Recent report generation work has been directed towards the Chest X-ray (CXR) modality, where large publicly available CXR datasets have become available for machine learning research, totaling 1 million images. These include the Indiana University CXR dataset,2 NIH Chest-Xray8 dataset,3 Stanford CheXpert dataset,4 and MIMIC-CXR.5

Most recent report generation studies use a variety of Convolutional Neural Network-Recurrent Neural Network (CNN-RNN) based approaches, where raw text reports along with images are directly used as input for model train-ing.6-11 Although less manual, unsupervised text approaches such as these do have some disadvantages. Firstly, they do not easily allow radiologists introduce workflow related needs, such as structured reporting (since most raw training reports come unstructured). However, of greater clinical concern is that there is no guarantee that report generation models trained this way will automatically learn about the key clinical findings in the image.

Since both the input and output stay "unstructured", robust evaluation of the "correctness" of the generated reports remains challenging for these CNN-RNN based models. Instead, most of the report generation algorithms reported in recent literature rely on performance metrics, such as CIDEr-D12, ROUGE-L13 and BLEU14, that measure the linguistic similarity between two text sequences. These statistical measures rely on n-grams, which can fail to capture disease states where terms that cue the context (e.g. negation) and the target abnormality terms are far apart in a sentence. As such, these metrics can fail to capture the overall accuracy and quality of the generated report.

Another approach to train report generation models is to first derive a richly labeled dataset from radiology reports using natural language processing (NLP). Existing work on labeling CXRs from CheXpert show that training labels can be efficiently and accurately extracted from reports with rule-based NLP.4 Rule-based NLP depends on robust vocabulary knowledge of the radiology domain (e.g. CXR modality) to accurately extract target labels. Since the knowledge curation work is usually a bottle neck for these NLP pipelines, they can be restricted to a narrow selection of labels. For example, CheXpert NLP pipeline is currently limited to extracting 14 labels from CXR reports. These labels are important CXR "triage" findings but are insufficient for populating preliminary reports. There is also no guarantee that, for each label, the radiologists would be able to manually produce a comprehensive list of vocabularies and patterns that reflect the wide range of documenting practices in free text reports.

In our work, we describe a systematic methodology, where domain experts (radiologists) used a human-in-the-loop unsupervised NLP tool15,16 to efficiently construct an anatomically-organized, report-language specific, lexicon-defined ontology. We show that, with the knowledge curated from one large corpus of CXR reports, even a simple rule-based NLP pipeline can extract accurate and rich labels from two different corpora of CXR reports. Additionally, we provide a detailed comparative analysis of the CXR reporting label landscape against existing work (CheXpert's). We utilized reports from an open source CXR report dataset for the latter analysis, and argue that the label set we extracted is much richer for training report generation imaging networks. Finally, we show that using the CXR lexicon ontology, our label extraction pipeline effectively structures the finding and impression sections of CXR reports in compliance with the anatomically organized CXR reporting template published by the Radiology Society of North America (RSNA).17

METHODS

I. Premise - driven by clinical and machine learning needs

One key goal for any NLP pipeline that extracts labels from radiology reports for training deep learning imaging models is to derive the "visual" presence or absence of target observations in the imaging exam. To do so, we observe that the knowledge created needs to meet several practical criteria:

  1. It should follow recommended terminologies (e.g. the CXR part of Fleishner Glossary18) where possible.

  2. However, more importantly, it needs to capture the large variations in terminologies used by average radiologists in their routine reporting practice. Terms such as "infiltrate" may not be recommended by the Fleishner Society, but are used routinely by many radiologists to describe visual findings that could look like "consolidation" (recommended). The reality is that terminologies not in professional guidelines are present in reports from any PACS system and need to be accounted for based on the most likely visual similarity for computer vision tasks.

  3. The vocabularies need to be grouped so that the extracted labels relate to visually similar images.

  4. To extract labels that are visually similar, a semantic distinction must be made between the "observed" and "inferred" types of label descriptions in reports.

  5. It should support the variation in which radiologists can describe abnormality and normality in free text reports. Radiology reporting practice boils down to describing whether an anatomy or an anatomical location (e.g. angle, border) is normal or abnormal (and if abnormal, what specific abnormality it is). For example, "cardiomegaly" is an abnormality ("enlargement") of the heart anatomy, and "rib fracture" is "fracture" of the rib anatomy. Therefore, to allow robustness of anatomy pattern based label extraction, we aim to structure and capture the anatomical relations between abnormalities and the related anatomies in the CXR ontology.

  6. There needs to be additional NLP context considerations for "visual label presence" in the labeled image. For example, X should actually be affirmed in "There is no increase in X", Y should be negated in "There is resolution of Y.", and Z may or may not be present in "Assess for Z". Just as there are variation in labels related terminology in reports, there is also a large variation of context determining language in radiology reports.

  7. The curated vocabulary and grouping need to be validated in context by multiple radiologists.

II. Building an exhaustive CXR lexicon-defined ontology with bottom-up curated vocabularies Top-down literature search:

The Fleischner Glossary contains recommended "proper" Chest X-ray (CXR) and Chest CT reporting terminologies from the North American Thoracic Radiology society.18 We kept only the terminologies related to CXR reporting as the starting "concept level" backbone of the CXR ontology, and refer back to Fleischner's finding definitions when grouping the bottom-up curated phrases from raw reports. In addition, we reviewed several radiology educational resources to determine rarer potential CXR labels less likely to appear in a limited selection of radiology reports.18-20

We also investigated the radiology terminologies from Radlex.21 However, we found it has a relatively small vocabulary for potential training label coverage in CXR reports. Its anatomical knowledge organization is also difficult to apply to real CXR reports for anatomy pattern based finding extraction or structured reporting. The closest existing CXR knowledge work suitable for labeling images using reports came from the CheXpert group, which focuses on 14 CXR findings only, but does have a starting (implicit) anatomical organization (e.g. lungs, pleura) for some of them.4

Bottom-up vocabulary curation - Assisted vocabulary expansion:

Dictionary expansion with the Domain Learning Assistant (DLA) tool is a human-in-the-loop method of rapidly developing more complete lists of terms around an "entity".15,16 It uses an ensemble of expansion techniques ranging from deep learning systems to pattern extraction to linked data in order to quickly and efficiently produce high quality candidates. However, in any real world application the definition of an entity type is quite subtle; thus, having a human domain expert (e.g. radiologist) quickly review the suggestions keeps the lexicon 'on topic', since they only accept good terms. These accepts and rejects allow the underlying system to focus (or refocus) on the important concepts.

For example, one major type of "entity" in CXRs is "opacity", which is the radiology terminology used to describe an area of increased whiteness (from attenuation of X-ray beam) within the lungs on a typical CXR. Our expert gave a few seed terminologies used to describe lung opacity related CXR observations that can be used in reports. The DLA expansion engine (Figure 1) uses the seed examples to propose additional likely candidates that occurred in similar textual context in the CXR reports we used for vocabulary curation. The experts then accepted all abbreviations, misspelled words, and any other different ways of describing "opacity" in the candidate list, and rejected the rest. For unclear cases, the experts were able to examine the most common contexts using the DLA tool before they accept or reject any proposed candidate. They are also able to accept just part of a proposed phrase. The newly accepted and rejected examples become additional training data to propose more candidates in future iterations. We were able to expand the lung opacity related terms to over 200 semantic equivalents in an hour.

Figure 1. Bottom-up assisted vocabulary expansion with Domain Learning Assistant (DLA) tooling. "Accepted" column contains the expert's seed and accepted phrases. "Candidate" column presents the phrases proposed by the DLA tool. "Rejected" phrases appear in the right most column. The experts can choose to view the report context for any phrases in the left panel.

Figure 1.

Assisted finer grouping of curated vocabulary into "lexicons" for label extraction:

After expanding a broad parent type of CXR observations, such as "lung opacity", we used the "Grouper" function of DLA to assist our experts in grouping the pool of phrases into finer children "lexicons". The tool accelerates lexicon grouping up to 5 times faster than manual methods.22 The DLA "Grouper" engine suggests potential groups to the radiologists for bottom-up lexicon group creation. The expert can use suggested groups as starting points to develop, name, expand and approve groups to form a final grouping. The system uses what human approves and rejects to learn what is important in this grouping. Using the lung opacity example again, the assisted grouping tool allowed our radiologists to group the bottom-up curated phrases into more specific types of opacity observations, such as consolidation, lobar/segmental collapse, pulmonary edema/hazy opacity, etc. The radiologists grouped any abbreviations, synonyms, semantic equivalent phrases under the same lexicon based on their understanding of visual similarity of the observation that the accepted phrases describe (Table 1).

Table 1: Semantically grouped lexicon examples.
Label name Related UMLS CUI Synonyms Semantic equivalents
Consolidation C0702116(or) C0239027 consolidated, conslidation, etc alveolar opacity, airspace infiltrate, etc
Lobar/segmental collapse C1522010 (and) C0004144 lobar collapse, lung collapse, etc volume loss, flat waist sign, etc

After the finer grouping were agreed upon by two radiologists, the experts then re-imported the groups as concepts back into the DLA engine for further vocabulary expansion until exhaustion (no more interesting candidates). Similar cycles of broad concept expansion, finer grouping and further expansion of grouped concepts were also done for lung lesion (mass and/or nodule), technical assessment, medical devices and anatomy related labels.

Organization of CXR lexicons:

The clinical experts assigned each bottom-up curated CXR lexicon concept a radiology-specific semantic category (Table 2 and Table 3). Additionally, each CXR lexicon concept is manually mapped to the closest Concept Unique Identifier(s) (CUI) in the widely accepted Universal Medical Language System (UMLS) ontology for bioinformatics research.23 The UMLS CUIs are selected based on semantic, lexical, ontological, and clinical definition similarity for each CXR lexicon concept. Any un-charted concepts are represented by combining one or more UMLS CUIs (Table1). This mapping effort grafts the curated CXR lexicon ontology to the UMLS ontology, which would allow it to leverage the wider medical knowledge base in future reasoning tasks.

Table 2: Semantic categories of labels in the bottom-up curated CXR lexicon ontology.
Semantic Category Description and rationale In CheXpert
Technical assessment Describes the quality of the image and whether there is any resulting diagnostic limitations. For example, ascertaining heart size abnormality is more difficult if the patient is rotated in the image. No
View point E.g. AP, PA or lateral view. Some findings are better or only visible from a certain view. PA CXR images are taken under more controlled conditions and tend to have better quality than AP images. We categorized view point labels separately from technical assessment since this information is inconsistently described in reports, but is available in structured DICOM meta-data. No
Tubes and lines Special support devices (e.g. ij line, endotracheal tubes) that are placed inside patients for treatment or monitoring purposes. Assessment of tubes and lines placement and ruling out placement related complications (e.g. pneumothorax) are common reasons why CXRs are taken. No
Tubes and lines finding Describes abnormal tubes and lines placement issues in CXR reports. No
Device Other medical devices visible in CXR images but where placement issues are irrelevant (e.g. ekg leads and stickers). Devices are included if they might possibly affect the diagnostic quality of the exam visually (e.g. by obscuring parts of anatomies). No
Finding (anatomical) These labels contain terminologies used by radiologists when they more objectively describe what they visually see in the imaging exam (e.g. lung opacity). Yes
Disease Disease (or diagnosis) concepts tend to incorporate radiologists own clinical judgment, inference, impressions or conclusions after incorporating additional knowledge (e.g. patient clinical information, known prior studies, clinical experience) outside of what is objectively visible within the particular CXR image. For the same CXR, different diseases (e.g. pneumonia, heart failure) can be reported if the clinical setting or patient information is different. No
Table 3: Semantic categories of CXR lexicons useful for context, pattern or anatomy recognition.
Semantic Category Description and rationale In CheXpert
Major structure Closely follows the recommended anatomical fields from the RSNA structured reporting template to facilitate structured reporting.17 No
Subanatomy More granular location terminologies that radiologists use in reports to describe where the anatomical abnormalities are seen. The subanatomy labels are grouped under the major structure labels to delineate parent-to-child anatomical hierarchies. No
Location Describes the placement locations of various tubes and line devices. No
Laterality Identifies whether a finding, disease, tubes and lines or device label is visually present on the left, right or both sides (bilaterally) on the image. No
NLP contexts Contain bottum-up curated terms useful for radiology reports specific context recognition. They help identify whether the mentioned label is likely visually present in the imaged. E.g. no change in opacity does not imply negation of opacity. Terms for describing normal or abnormal anatomies and other visual modifiers (e.g. uncertainty and severity) of findings are also in this semantic category. No

We semantically differentiated between "observed" type concepts (a.k.a. labels) as "finding" labels and "inferred" type concepts as "disease" labels. This helps distinguish which report derived labels may have more objective visual signatures in images. Existing work from CheXpert has not made this distinction. However, this difference is clinically important for automatic image labeling with radiology reports. Report language contains both what radiologists visually observed in images and what they inferred from that observation with additional clinical knowledge outside the image. For example, "lung opacity" is the most objective observation, "consolidation" is often used to describe an opacity that possibly look like a pneumonia, and "pneumonia" is most clearly a clinical inference (disease diagnosis) based also on knowing that the patient has a fever and cough, etc. However, that opacity might also be documented as some other disease processes (e.g. fluid overload/heart failure) if the clinical information had been different.

Lastly, to support structured label extraction, we organized and related every finding and disease labels to the anatomical structure labels that they can be expected to be found in. The tubes and lines related labels are related to the location labels for placement extraction. For example, "consolidation" is a finding in the "lungs" (major structure), and tip of a "IJ line" can be described to be found in the "svc", "cavoatrial junction", "right atrium", "brachiocephalic vein" and "internal jugular vein", etc. For ease of working closely with the clinical experts, we represented the curated CXR lexicon ontology in a flat tabular format, where each column have pre-specified relations to other columns.

III. Study objectives and setup for comparing CXR label coverage and report understanding

In this study, we demonstrate the breadth of label coverage of our systematically curated CXR lexicon ontology for concepts in free text CXR reports. We compare this to the extraction results from the most commonly used CXR labeling ontology and NLP pipeline, CheXpert, in the medical computer vision community.

A clinical knowledge driven, simple NLP pipeline:

Firstly, we need to establish that the simple rule-based NLP pipeline we developed with the CXR lexicon ontology performs comparably with CheXpert's so that we can more accurately compare our labeling output against existing CheXpert's based on vocabulary differences. CheXpert's NLP pipeline was written with some crafted rules around its vocabulary and labels, which means we cannot simply substitute our vocabulary in CheXpert's pipeline without affecting its performance. We illustrate below (Figure 2) the steps implemented in our NLP pipeline.

Figure 2. A simple rule-based NLP pipeline that utilizes the CXR lexicon ontology.

Figure 2.

Labeled output comparison with CheXpert's pipeline:

Secondly, we aim to understand the extraction results gained from additional vocabulary, as compared with only the CheXpert vocabulary. For this, we first mapped the relevant CXR lexicon labels we curated to the 13 of the existing CheXpert labels by using their phrasal grouping definitions (Table 4). The "no finding" CheXpert label is dropped in this analysis since it is a less meaningful catch-"most other"-label. Since some CheXpert label definitions are very high in our ontological hierarchy (e.g. support devices, fracture), we included all the relevant child labels that would be extracted from our pipeline. We then ran the original CheXpert pipeline through NegBio on the same set of tokenized (and quality checked) sentences to generate the results on their 13 labels for comparison. The common sentences removes any sentence tokenization error as a variable.

Table 4: Mapping between CheXpert's labels and CXR lexicon ontology.
CheXpert label CXR lexicon label (Same level/meaning) CXR lexicon labels
(Related by how CheXpert groups/uses its phrases)
Support devices device present ekg leads and stickers, other external medical devices, cardiac pacer and wires, msk or spinal hardware, other internal post-surgical material, sternotomy wires, tubes or lines present (nos), central venous line (not otherwise specified), chest port, dialy-sis/pheresis line, ij line, picc, subclavian line, swan-ganz catheter, chest tube, medi-astinal drain, pigtail catheter, enteric tube, endotracheal tube, tracheostomy tube
Pleural effusion pleural effusion
Pleural Other pleural/parenchymal scarring
Edema pulmonary edema/hazy opacity vascular congestion, fluid overload/heart failure
Consolidation consolidation
Lung Opacity not otherwise specified opacity (pleural/parenchymal opacity) infiltration, increased reticular markings/ild pattern, pleural/parenchymal scarring
Atelectasis linear/patchy atelectasis lobar/segmental collapse
Enlarged
cardiomediastinum
mediastinal widening
Lung lesion mass/nodule (not otherwise specified) multiple masses/nodules, cavitary mass/nodule
Pneumonia pneumonia
Cardiomegaly enlarged cardiac silhouette
Fracture fracture new fractures, old fractures, rib fracture, spinal fracture, clavicle fracture, humerus fracture, scapula fracture, sternal fracture
Pneumothorax pneumothorax hydropneumothorax (CheXpert does partial matching so treats this finding the same as pneumothorax)

Lastly, we will demonstrate additional label coverage/distribution, semantic richness, and added anatomical structuring use case by labeling with the CXR lexicon ontology (without mapping to CheXpert's labels).

IV. CXR report datasets

Dataset for CXR vocabulary curation:

We queried 200,000 de-identified CXR reports from the "noteevents" table in the MIMIC-III dataset for vocabulary curation and CXR lexicon ontology development. MIMIC-III is a restricted access Intensive Care Unit dataset sourced from one tertiary Boston Hospital.24 We picked this corpus for its accessibility and range of reporting practice from different hospital departments over 10 years.

Datasets for establishing our NLP pipeline's label extraction performance:

1. 500 Indiana Hospital CXR reports2 - Random 500 out of 3851 unique Indiana CXR reports were dual annotated to evaluate the precision, recall and F-1 scores for all the CXR relevant labels described in this set of 500 reports (83 specifically mentioned and 47 normal/abnormal anatomy labels). Spotted label trigger phrases are marked where possible and all sentences are reviewed twice, for correctness of context detection (negation or affirmation by non-MD and MD), radiology semantics (MD) and to assess recall. Disagreements are resolved by a third annotator (MD).

2. 3000 NIH CXR reports - 3000 Anterior-Posterior (AP) CXR images sampled from the NIH Chest-Xray8 dataset3 were internally de-novo reported by crowd-sourced radiologists. All 3000 free text reports are single annotated (non-MDs or MDs) to ground truth for 45 finding category labels.

Dataset for exploring coverage and statistics of extracted labels:

For reproducibility (we intend to open source the labeled dataset), we ran our CXR labeling pipeline on the Indiana Firstly, we need to establish that the simple rule-based NLP pipeline we developed with the CXR lexicon ontology performs comparably with CheXpert's so that we can more accurately compare our labeling output against existing CheXpert's based on vocabulary differences. CheXpert's NLP pipeline was written with some crafted rules around its vocabulary and labels, which means we cannot simply substitute our vocabulary in CheXpert's pipeline without affecting its performance. We illustrate below (Figure 2) the steps implemented in our NLP pipeline.

Hospital CXR dataset, which has 3851 unique de-identified free text CXR reports, available from the Open-i website.2

RESULTS

I. Contribution to CXR report labeling vocabulary and ontology

Overall, the CXR lexicon ontology's largest contribution is having added 27 labels for identifying different devices or tubes and lines, and 26 labels describing technical assessment problems. These are areas missing or poorly covered by CheXpert. Compared to the CheXpert baseline (13 comparable CXR concepts, 93 associated unique phrases and 15 relations), we identified 329 unique CXR concepts described in CXR reports, 8752 associated unique raw phrases (1792 unique phrases after removing partial matches), 12 semantic categories and 408 ontological relations from this process. The curation process took 3 experts (2 radiologists and 1 clinician) two weeks to complete (part time). All the accepted phrases for each lexicon have been validated by both radiologists through the DLA tool interface.

Our experts accepted many additional terms proposed by the DLA tool, many of which have partial matches in CheX-pert's vocabulary. However, the longer phrases in the curated vocabulary tend to be more visually descriptive of the target finding. For example, "infiltrate" alone is less specific than "alveolar infiltrate" and "interstitial infiltrate", which denotes different visual finding labels so were grouped under different lexicons. Therefore, we did not remove lexical redundancies in the rich vocabularies that the radiologists deemed relevant to include for each lexicon. However, for comparison purposes, we show the actual number of phrases for each of the label (compared at the CheXpert level) that are not partial matches of each other, as well as all the additional phrases that the curation process have added to existing knowledge as compared to CheXpert (Figure 3).

Figure 3. Comparison of curated vocabulary against 13 CheXpert labels according to CheXpert’s phrases grouping definition.

Figure 3.

II. Briefly: CXR label detection performance via a simple NLP pipeline

We assessed the affirmation and negation performance for the NLP pipeline used for comparison with CheXpert's labeling coverage in following sections. A snapshot of our results (Table 5) are calculated on two ground truth datasets as described in methods section. More detailed per label F-1 score data is presented for the Indiana report dataset in later section (Figure 5). Overall, the simple NLP pipeline achieved > 90% average precision and recall over a large set of target labels. The 45 labels evaluated for the NIH reports are all finding semantic labels, where straight forward spotting of phrases in the CXR lexicon ontology followed by negation worked well (average precision 99.00%, recall 96.37%). The 83 specifically mentioned labels for Indiana also includes technical assessment and detailed tubes and lines placement labels, which are harder to detect without more complex pattern recognition.

Table 5: Negation detection performance of a simple NLP pipeline using CXR lexicon ontology.

Dataset Number and types of labels validated Average precision Average recall
Indiana (500 reports) 83 specifically mentioned labels
47 abnormal/normal anatomy description labels
94.87% 99.51% 93.83% 92.63%
NIH (3000 reports) 45 specifically mentioned finding labels 99.00% 96.37%

Figure 5. Label detection using CXR lexicon ontology on Indiana Hospital CXR Dataset. Top and middle figures, x-axis labels blue if covered by CheXpert. Middle and bottom figures, bars color-coded for each label's F-1 scores (on 500 reports). Bottom figure, axis label name is black if label is affirmative, or red if label is negated.

Figure 5.

III. Comparing extracted label results using CheXpert's label definitions

To assess whether and how much additional extractions our CXR lexicon vocabulary adds to labeling CXR images from reports, we compare the extracted output with the CheXpert vocabulary alone and with added bottom-up curated vocabulary. For label mapping and rationale, see methods (Table 4).

Overall, the added vocabulary did increase the number of mentioned detection for most of the labels (Figure 4), though to varying amount, which should still help reduce the problem of missing data in the final labeled dataset for imaging classifier training. In addition, the increase detection did not come at the expense of per label accuracy (Figure 5).

Figure 4. Label detection using different vocabularies on Indiana Hospital CXR Dataset. Mentioned (top-left), affirmative (top-right), negated (bottom-left) and uncertain (bottom-right) cases that are extracted from reports.

Figure 4.

The only label that did not have improved detection is "enlarged cardiomediastinum". Reviewing the CheXpert NLP pipeline, we found that almost all its hand-crafted rules deal with extracting negated and uncertain cases for this label. The affirmed cases of "enlarged cardiomediastinum" typically only mean that the "cardiomediastinum" anatomy has been mentioned, which by itself does not entail the abnormality described by the label name, "enlarged cardiomedi-astinum". This suggests there are some semantic inconsistency in CheXpert's vocabulary.

IV. Additional report insight gained with granular CXR labels and radiologic-specific semantic grouping

We used the CXR lexicon ontology and our NLP pipeline to label the whole Indiana Hospital CXR report dataset and show the prevalence statistics for the most common and clinically important CXR labels extracted. The semantically categorized results show a complex label landscape that radiologists describe in CXR reports.(Figure 5, top graph)

Overall, we show a fuller spectrum of different labels useful for reporting extracted using the CXR lexicon ontology (Figure 5, middle & bottom) compared to baseline. In addition, for CXR reporting, it is also key for radiologists to document whether various anatomies are observed to be normal or not (Figure 5, bottom graph). Therefore, besides labels that specifically name the abnormalities (Figure 5, middle graph), label patterns that describe normality/abnormality of key anatomies are also important for producing anatomy descriptors to train an automated system.

Finally, we attempt to illustrate some clinically known correlations between selected finding labels and labels from other semantic categories (disease, technical assessment and tubes and lines). E.g. Low lung volume (e.g. from poor technique) can be associated with description of vascular congestion lung appearance in CXRs (Figure 6). Additionally, as color-coded in the same figure, all labels extracted with the CXR lexicon ontology can be related to one or more key major structure concept (e.g. "consolidation" [is a] "finding" of the "lungs" "major structure"). Since the major structure concepts followed the structured reporting template for CXRs from RSNA,17 the labels extracted (and their source sentences) can be categorized into to these structured reporting fields.

Figure 6. Correlation heatmap for selected labels. X-axis has finding labels for: lungs (blue), pleura (black), mediastinum (red), bones (blue), other (brown). Y-axis has labels for tubes and lines (black), technical assessment (purple), disease (red).

Figure 6.

DISCUSSION

The systematic methodology presented utilized a human-in-the-loop NLP tool for the bottom-up curation of a domain-specific ontology for structuring CXR reports. The approach has the advantage of harnessing unsupervised NLP techniques to suggest candidate vocabulary that may not have been considered by human domain experts, whilst having the experts in the loop to make sure only good vocabulary for the task get included. In particular, by purposefully starting with a broader vocabulary inclusion criteria then grouping to finer concepts (which also helps in qualitatively educating the domain experts on what actually get described in real-life reports), we aimed to execute a less biased bottom-up curation process. However, a current limitation of our current work is lack of wider validation of the ontology by the fuller radiology community. For this reason, we hope to contribute the CXR lexicon ontology we curated to the community (pending on internal legal approvals) to facilitate wider expert validation and improvement.

CONCLUSION

A rich CXR lexicon ontology was efficiently constructed within two weeks with a human-in-the-loop NLP tool, where 3 domain experts curated 329 report concepts (over 23 times more than baseline), 8752 unique raw phrases, 12 semantic categories, and 408 relations. We show that with a robust context detection vocabulary and richness of recognizable concepts, even a simple NLP pipeline can extract a rich set of clinically reliable labels (average recall and precision > 90% on two datasets) for training supervised report generation imaging networks. The process allows the experts to semantically group and anatomically organize labels, contributing to a better understanding of CXR language and reporting needs. Further work is needed to evaluate whether introducing expert knowledge into labeling datasets can indeed train more reliable report generation systems than the existing CNN-RNN based networks.

Figures & Table

References

  • [1].Zha Nanxi, Patlas Michael N, Duszak Richard. Radiologist burnout is not just isolated to the united states: perspectives from canada. Journal of the American College of Radiology. 2019;16(1):121–123. doi: 10.1016/j.jacr.2018.07.010. [DOI] [PubMed] [Google Scholar]
  • [2].Demner-Fushman Dina, Kohli Marc D, Rosenman Marc B, et al. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association. 2016;23(2):304–310. doi: 10.1093/jamia/ocv080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Wang Xiaosong, Peng Yifan, Lu Le, Lu Zhiyong, Bagheri Mohammadhadi, Summers Ronald M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on CVPR. 2017. pp. 2097–2106.
  • [4].Irvin Jeremy, Rajpurkar Pranav, Ko Michael, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019;volume 33:590–597. [Google Scholar]
  • [5].Johnson Alistair EW, Pollard Tom J, Berkowitz Seth, Greenbaum Nathaniel R, Lungren Matthew P, Deng Chih-ying, Mark Roger G, Horng Steven. Mimic-cxr: A large publicly available database of labeled chest radiographs. arXivpreprint arXiv:1901.07042. 2019;1(2) doi: 10.1038/s41597-019-0322-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Han Zhongyi, Wei Benzheng, Leung Stephanie, et al. MICCAI. Springer; 2018. Towards automatic report generation in spine radiology using weakly supervised framework; pp. 185–193. [Google Scholar]
  • [7].Gale William, Oakden-Rayner Luke, Carneiro Gustavo, Andrew P Bradley, Palmer Lyle J. Producing radiologist-quality reports for interpretable artificial intelligence. arXiv preprint arXiv:1806.00340. 2018.
  • [8].Wang Xiaosong, Peng Yifan, Lu Le, Lu Zhiyong, Summers Ronald M. Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on CVPR. 2018. pp. 9049–9058.
  • [9].Yuan Li, Liang Xiaodan, Hu Zhiting, Xing Eric P. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in neural information processing systems. 2018. pp. 1530–1540.
  • [10].Liu Guanxiong, Harry Hsu Tzu-Ming, McDermott Matthew, et al. Clinically accurate chest x-ray report generation. In Proceedings of the 4th MLHC, volume 106 of Proceedings of Machine Learning Research. 2019. pp. 249–269. . PMLR, 09-10 Aug.
  • [11].Zhang Yuhao, Ding Daisy Yi, Qian Tianpei, et al. Learning to summarize radiology findings. arXiv preprint arXiv:1809.04698. 2018.
  • [12].Vedantam Ramakrishna, Zitnick C Lawrence, Parikh Devi. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on CVPR. 2015. pp. 4566–4575.
  • [13].Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollar Piotr, Zitnick C Lawrence. European conference on computer vision. Springer; 2014. Microsoft coco: Common objects in context; pp. 740–755. [Google Scholar]
  • [14].Papineni Kishore, Roukos Salim, Ward Todd, Zhu Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In Proceedings ofthe 40th annual meetingon association for computational linguistics. 2002. pp. 311–318.
  • [15].Coden Anni, Gruhl Daniel, Lewis Neal, et al. 2012 IEEE Second International Conference on HISB. IEEE; 2012. Spot the drug! an unsupervised pattern matching method to extract drug names from very large clinical corpora; pp. 33–39. [Google Scholar]
  • [16].Alba Alfredo, Gruhl Daniel, Ristoski Petar, Welch Steve. Interactive dictionary expansion using neural language models. In HumL@ ISWC. 2018. pp. 7–15.
  • [17].Schmidt Teri Sippel. Rad Chest 2 Views. RSNA Radreport 2017 (accessed March 11 2020). https://radreport.org/home/50271/2017-11-14%2017:22:0 9 .
  • [18].Hansell David M, Bankier Alexander A, MacMahon Heber, et al. Fleischner society: glossary of terms for thoracic imaging. Radiology. 2008;246(3):697–722. doi: 10.1148/radiol.2462070712. [DOI] [PubMed] [Google Scholar]
  • [19].Folio Les R. Chest imaging: an algorithmic approach to learning. Springer Science & Business Media; 2012. [Google Scholar]
  • [20].Shepard Jo-Anne O. Thoracic Imaging The Requisites E-Book. Elsevier Health Sciences; 2018. [Google Scholar]
  • [21].Langlotz Curtis P. Radlex: a new method for indexing online educational materials. 2006. [DOI] [PubMed]
  • [22].Coden Anni, Danilevsky Marina, Gruhl Daniel, Kato Linda, Nagarajan Meena. Proceedings ofthe 2017 SIAM International Conference on Data Mining. SIAM; 2017. A method to accelerate human in the loop clustering; pp. 237–245. [Google Scholar]
  • [23].Bodenreider Olivier. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research. 2004;32(suppl_1):D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Johnson Alistair EW, Pollard Tom J, Lu Shen, et al. Mimic-iii, a freely accessible critical care database. Scientific data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES