Combining Deep Learning and Knowledge-driven Reasoning for Chest X-Ray Findings Detection

Ashutosh Jadhav; Ken C L Wong; Joy T Wu; Mehdi Moradi; Tanveer Syeda-Mahmood

. 2021 Jan 25;2020:593–601.

Combining Deep Learning and Knowledge-driven Reasoning for Chest X-Ray Findings Detection

Ashutosh Jadhav ¹, Ken C L Wong ¹, Joy T Wu ¹, Mehdi Moradi ¹, Tanveer Syeda-Mahmood ¹

PMCID: PMC8075485 PMID: 33936433

Abstract

The application of deep learning algorithms in medical imaging analysis is a steadily growing research area. While deep learning methods are thriving in the medical domain, they seldom utilize the rich knowledge associated with connected radiology reports. The knowledge derived from these reports can be utilized to enhance the performance of deep learning models. In this work, we used a comprehensive chest X-ray findings vocabulary to automatically annotate an extensive collection of chest X-rays using associated radiology reports and a vocabulary-driven concept annotation algorithm. The annotated X-rays are used to train a deep neural network classifier for finding detection. Finally, we developed a knowledge-driven reasoning algorithm that leverages knowledge learned from X-ray reports to improve upon the deep learning module's performance on finding detection. Our results suggest that combining deep learning and knowledge from radiology reports in a hybrid framework can significantly enhance overall performance in the CXR finding detection.

Introduction

Applications of deep learning (DL) in healthcare cover a broad range of problems ranging from cancer screening and disease monitoring to personalized treatment suggestions. In the next few years, DL-based applications can be a significant part of routine clinical workflow and can assist healthcare providers in predicting diagnosis, prescribing medications, and suggesting treatments and patient management strategies¹. The notion of applying deep learning-based algorithms to medical imaging data is a growing research area. With fast improving computational power and the availability of enormous amounts of data, deep learning has become the de facto standard for a wide variety of computer vision problems. Currently, in the clinical setting, a large portion of the interpretations of medical images are being done by medical experts. The number of medical images that emergency room radiologists have to analyze can be overwhelming. DL techniques can help radiologists sift through the data and analyze medical exams more efficiently. In medical imaging, X-rays are the most common imaging exam being conducted in emergency/urgent care. Moreover, recently, several large CXR (Chest X-ray) datasets (such as ChestX-ray14², Chexpert³, and MIMIC III⁴) are made available to the scientific community. Thus, in recent years, we see a surge of DL based algorithms for the CXR finding detection.

While DL methods are thriving in the medical domain, they seldom utilize the rich knowledge associated with connected radiology reports. For example, DL models developed for detection of findings from radiology images such as X-rays do not take into account the statistical correlation between the detected findings. CXR reports are a great source of knowledge, and by analyzing an extensive collection of the reports we can identify patterns between reported findings based on their co-occurrences across the report collection. Such knowledge can improve DL models’ performance by recommending missed as well as overcall findings. Consider the following scenario – by analyzing CXR reports, we have identified “pleural effusion” and “opacity” have a very high probability of co-occurrence across the report collection. If an image-based deep learning model running on the associated CXR only detected “pleural effusion” and missed “opacity”, we could improve (boost) the predictive score for “opacity” by a delta and turn a missed finding into a detected finding. Similarly, we can potentially improve the deep learning model’s performance by utilizing knowledge to decrease the number of false-positive findings.

In this work, we are focusing on CXR findings. The interpretation of a CXR can diagnose many conditions such as pleural effusion, pneumonia, infiltration, nodule, atelectasis, pulmonary edema, cardiomegaly, pneumothorax, fractures, and many others. We have used 473,057 CXR images for 63,480 unique patients and associated 206,574 radiology reports. We used a comprehensive CXR finding vocabulary to automatically annotate an extensive collection of CXRs using associated radiology reports and a vocabulary-driven concept annotation algorithm⁵. The annotated X-rays are then used to train a multi-label deep neural network classifier for CXR finding detection. In the knowledge-driven approach, the correlation between different radiology findings is learned by leveraging the radiology reports to compute the statistical correlation between findings. The reasoning algorithm modifies the prediction score from the DL models based on the label correlations and using a hyperparameter optimization by performing an exhaustive grid search in a 4-dimensional space. This approach improves the overall performance of DL models with a 9% relative improvement in the F1-score. The approach to combining deep learning and knowledge-driven reasoning in a hybrid framework will potentially transform existing methods for building computational models.

Related work

One main challenge in the DL driven medical image analysis is the availability of large datasets with reliable ground-truth annotation. Therefore, transfer learning approaches, as proposed by Bar et al.⁶, were often considered to overcome such problems. Bar et al.⁶ applied a pre-trained Decaf Convolutional Neural Network model⁷ to detect the chest pathology in X-rays. ChestX-ray142 is one of the largest publicly available CXR dataset, and it consists of 112,120 frontal CXR images from 30,805 unique patients. Due to its size, it is have received considerable attention in the deep learning community and has been used by several researchers to address the CXR findings classification problem. Wang et al.² evaluated several state-of-the-art convolutional neural network architectures and reported an average area under the ROC curve (AUC) of 0.75. Gündel et al.⁸ proposed a location aware dense networks to classify pathologies in CXR images. Yao et al.⁹ developed a model to exploit dependencies among the disease labels. They used Densenet10 as an encoder for image and long short-term memory (LSTM)¹¹ as a decoder to generate labels. Rajpurkar et al. proposed transfer-learning with fine-tuning, using a DenseNet-121¹², that raised the AUC results on ChestX-ray14 for multi-label classification even higher. Guan et al.¹³ considered the problem of multi-label thorax disease classification on CXR images by proposing a Category-wise Residual Attention Learning (CRAL) framework. Wang et al.¹⁴ developed the Text-Image Embedding network (TieNet) architecture with the integration of CNN and RNN. The TieNet architecture is used for thorax disease classification and reporting in CXR.

Materials and Methods Dataset

For this work, we have primarily used the CXR collection from MIMIC III⁴ dataset from the Laboratory for Computational Physiology at MIT. The dataset consists of 63,480 unique patients, 206,574 radiology reports, and 473,057 DICOMs (X-ray images). Each report is associated with one or more images in DICOM format of different views. For a patient, there can be multiple reports including follow-up examinations.

2.1. Preliminaries

In this subsection, we will be describing our prior work that is essential for the completeness of this paper.

CXR findings vocabulary

Currently, there is no vocabulary in UMLS¹⁴ or outside UMLS that captures all possible findings that radiologist considers in the real clinical setting. To address this problem, we have developed a custom, comprehensive CXR vocabulary with the help of clinical experts and radiologists. The vocabulary is a collection of all major radiology findings in CXRs and the modifiers used to describe their anatomical locations, laterality, size, and severity. The vocabulary captures various abbreviations, misspelling, and semantically equivalent ways of describing the same radiology concepts (synonyms and alternate forms) as shown in (Table 1). A team of clinicians consisting of three radiologists and one internal medicine doctors was assembled to assist in the vocabulary generation and validation process. We have utilized a combination of top-down knowledge (from radiologists and clinical textbook) and bottom-up knowledge (large-scale CXR reports) to identify CXR related core findings. sources provided top-down knowledge. With this process, we identified about 1500 terms useful for findings vocabulary. These terms are reviewed and validated by four radiologists. The terms retained in the vocabulary after radiologist review are hereafter referred to as core findings for CXR vocabulary.

Table 1.

CXR findings and their mapped related concepts from the findings’ vocabulary

CXR findings	Associated CXR findings
pleural effusion	Pleural effusion, pleura fluid, layering effusions, pleural fluid/thickening, loculating fluid, intrafissural fluid, hemothorax, etc.
consolidation	consolidation, conslidation, consolidated, airspace opacity, alveolar opacity, alveolar infiltrate, etc.
pneumothorax	pneumothorax, pneumothoraces, ptx, pneumoptx, deep sulcus sign, etc.
multiple masses/nodules	multiple masses/nodules, masses, nodules, lesions, rounded densities, miliary pattern, etc.

Open in a new tab

CXR report annotation:

Deep learning models requires a large amount of training dataset as model’s performance is mainly dependent on the quality and size of the dataset. However, unavailability of the annotated dataset is one the most significant barrier in the success of deep learning in medical imaging. The development of significant medical imaging data is quite challenging as annotation process is time consuming and requires clinical expertise. To address this problem, we have developed the auto-annotation methodology that annotates the vast collection of radiology reports with CXR findings using a vocabulary-driven concept annotation algorithm⁵. Specifically, by selecting all terms from the CXR vocabulary as potential concepts, the algorithm for concept annotation was used to spot their occurrence in the extracted sentences from selected sections of the report (‘Findings’ or ‘Overall impression’). Negation is often seen in radiology reports, and it is essential to detect negations accurately to facilitate high-performance core findings detections from the reports. The negation pattern detection algorithm iteratively identifies words within the scope of negation based on dependency parsing. We evaluated the performance of the concept annotation algorithm on 2771 radiology reports using two radiologists. The radiologists manually validate 10,842 findings in their flagged sentences and found only 84 semantically inaccurate detections leading to an overall finding annotation precision of 96.2%.

Using this algorithm, we annotated 206K the CXR reports from the MIMIC III with the core findings from the CXR vocabulary. After the annotation step, we represented each report as a vector of unique annotated core findings. For core findings mentioned in the negative context, term “no” is added before the core findings. For example, here is a sample core findings vector that represents a report in the format of positive and negated core findings - [“no pulmonary edema”, “aspiration”, “pneumonia”, “no pneumothorax”]. Each report is associated with one or more CXR images. The core findings vector generated from the 206K reports are used to annotate the associated 473K CXR images from MIMIC_III dataset with CXR findings.

2.2. Experimental set-up

Our objective in this experiment is to predict a large set of CXR findings using a deep neural network classifier and improve prediction outcomes using a knowledge-driven reasoning algorithm. We have selected a set 54 labels (Table 2) that represents clinically essential CXR findings and for which we had enough training data (at least 1000 images per label). These labels are reviewed by our clinicians to define their semantics and scope. For the experiment, we used the annotated reports and CXR images mentioned in the previous section. To generate an experimental dataset, we selected 339,558 CXR images out of 473,057 CXR images from the MIMIC III dataset, that are annotated with at least one selected label. The dataset is divided into training (70%), validation (10%) and testing (20%) dataset. The data sampling algorithm maintains similar frequency distribution of labels within training, validation and test data. The splitting algorithm sorts the distribution of labels by their frequency of occurrences, and iteratively assigns the images from distinct patients to the three datasets taking care to maintain the ratio of 70%, 10%, 20% for training, validation and test datasets. This dataset is used for the training and testing of the deep neural network classifier. The outcome of the DL module (i.e. the deep neural network classifier’s label prediction on the testing dataset) is given as an input for the reasoning module. The reasoning module splits the output from the DL module into training (80%) and testing (20%) datasets. Next, we will describe our methodology for computing statistical correlations between the labels and the reasoning algorithm.

Table 2.

List of selected 54 CXR findings labels

azygous fissure (benign), bone lesion, bullet/foreign bodies, calcified nodule, clavicle fracture, consolidation, contrast in the gi or gu tract, cyst/bullae, degenerative changes, diffuse osseous irregularity, dilated bowel, dislocation, elevated hemidiaphragm, elevated humeral head, enlarged cardiac silhouette, enlarged hilum, fracture, hernia, humerus fracture, hydropneumothorax, hyperaeration, increased reticular markings/ild pattern, linear/patchy atelectasis, lobar/segmental collapse, lobectomy, lymph node calcification, mass/nodule (not otherwise specified), mediastinal displacement, multiple masses/nodules, new fractures (acute fractures), normal anatomically, not otherwise specified calcification, not otherwise specified opacity (pleural/parenchymal opacity), old fractures, osteotomy changes, other soft tissue abnormalities, pleural effusion or thickening, pneumomediastinum, pneumothorax, post-surgical changes, pulmonary edema/hazy opacity, rib fracture, scapula fracture, scoliosis, shoulder osteoarthritis, spinal degenerative changes, spinal fracture, sternal fracture, sub-diaphragmatic air, subcutaneous air, superior mediastinal mass/enlargement, tortuous aorta, vascular calcification, vascular redistribution

Open in a new tab

2.3. Compute statistical correlations between labels

In this step, we compute the statistical correlation between labels based on their co-occurrence with other labels across the entire reports collection (206K reports). A label can be mapped to multiple core findings (a one to many mapping) from the CXR vocabulary. The mapping between the core findings is identified and reviewed in the CXR vocabulary generation process (Section 2.1). Clinicians reexamine the mapping between the selected 54 labels and the core findings to ensure the correctness. As mentioned in the previous Section 2.1, each report from the report collection is annotated with the core findings and represented as a core finding vector. For the label correlation computation, we took 206K core findings vectors as an input. Using the label to core findings mapping, we transform all the core findings vectors to label vectors by replacing the core findings with their respective mapped label. Since only a subset of the CXR findings are mapped to selected labels and not all reports contain findings related to the selected labels, after the vector transformation some label vectors were empty. As we are computing the co-occurrence, after the vector transformation, we retained only the vectors that have at least two labels.

In the next step, for each selected label, we scan all the report label vectors and keep a count of the number of times a label co-occurs with the rest of the labels from label-set. Here is a of label co-occurrence count example (Table 3) that shows "pleural effusion or thickening" has co-occurred with "opacity" in 53,338 label vectors and with "linear/patchy atelectasis" in 20,389 label vectors.

Table 3.

Label co-occurrence example

Label	Co-occurred labels	co-occurrence count
pleural effusion or thickening	opacity	53338
	linear/patchy atelectasis	20389
	not otherwise specified opacity (pleural/parenchymal opacity)	16766
	consolidation	7734
	lobar/segmental collapse	4114
	chest tube	2466

Open in a new tab

Finally, we computed co-occurrence normalization that rescales the co-occurrence values into a range of [0,1]. Here 1 is the highest co-occurrence and 0 being the least co-occurrence. We have used feature scaling normalization formula, X_norm=(X−X_min)/X_max−X_min, where X is the original co-occurrence frequency, X_norm is the normalized co-occurrence value, X_min and X_max are the minimum and maximum co-occurrence frequencies.

An important point to note is that the normalized co-occurrence between two labels is a function of the label’s co-occurrence frequency with other labels. Thus, two labels (A, B) may have different normalized co-occurrence values for (label A, label B) and (label B, label A). To illustrate this fact, consider the following 7 documents represented as label vector with labels L_1-5.

Doc 1: {L₁, L₃, L₅}

Doc 2: {L₁, L₃}

Doc 3: {L₁, L₃, L₄}

Doc 4: {L₁, L₄, L₅}

Doc 5: {L₁, L₄}

Doc 6: {L₁, L₃}

Doc 7: {L₃, L₄}

We calculated the normalized co-occurrence between labels using the feature scaling normalization formula (Table 4). In this example, normalized co-occurrence values for L₁ and L₃ are the same i.e. (L₁, L₃) = (L₃, L₁) while for L₃ and L₄ they are different i.e. (L₃, L₄) ≠ (L₄, L₃).

Table 4.

Normalized label co-occurrence using feature scaling

Metric	Values
Co-occurrence frequency L₁	{L₂ = 0, L₃ = 4, L₄ = 3, L₅ =2}
Normalized co-occurrence values for L₁	(L₁, L₂) = 0, (L₁, L₃) = 1, (L₁, L₄) = 0.75, and (L₁, L₅) = 0.50

Co-occurrence frequency L₃	{L₁ = 4, L₂ = 0, L₄ = 2, L₅ =1}
Normalized co-occurrence values for L₃	(L₃, L₁) = 1, (L₃, L₂) = 0, (L₃, L₄) = 0.50, and (L₃, L₅) = 0.25

Co-occurrence frequency L₄	{L₁ = 3, L₂ = 0, L₃ = 2, L₅ =1}
Normalized co-occurrence values for L₄	(L₄, L₁) = 1, (L₄, L₂) = 0, (L₄, L₃) = 0.66, and (L₄, L₅) = 0.33

Open in a new tab

We followed these identical steps to compute the co-occurrence between all label pairs for the selected 54 labels using the report label vectors.

2.4. Deep learning Module

As mentioned in the experimental set-up (section 2.2), DL module used of 339,558 CXR images dataset labeled with selected 54 labels. An architecture of our DL model (deep neural network classifier) shown in Figure 3¹⁵. It combines the advantages of pre-trained features with a multi-resolution image analysis through a feature pyramid network¹⁶ for fine-grained classification. Specifically, VGGNet (16 layers)¹⁷ and ResNet (50 layers)¹⁸ were used as the initial feature extractors and trained on multi-million image collection from ImageNet¹⁹. Dilated blocks composed of multi-scale features²⁰ and skip connections²¹ were used to improve convergence while spatial dropout was used to reduce overfitting. Group normalization (16 groups)²² is used, along with a Rectified Linear Unit (ReLU) as the activation function. Dilated blocks with different feature channels are cascaded with max pooling to learn more abstract features, and bilinear pooling is used for effective fine-grained classification²³.

Once trained, the neural network can be used to predict the likelihood of a label in a given image. To ensure we report as few irrelevant findings while still detecting all critical findings, we select operating points or threshold on the ROC curves per label such that an objective function reflecting this tradeoff is minimized. Specifically, we form an objective function $L (θ) = - In (\frac{1}{n} \sum_{i = 1}^{n} F 1 i (θ))$ , by averaging the F1 score per image 𝑖 across all images 𝑛 of a validation set. The neural network architecture shown in Figure 1 was used to train and predict the 54 findings on the dataset. The images were of high resolution (1024x1024) so training completed in approximately 8 days using multi-GPU machines. We have used F1-score for optimizing the per-label thresholds to maximize the DL module’s performance.

Figure 1. — Deep learning module’s neural network architecture for CXR finding detection

2.5. Knowledge-driven Reasoning

The objective of the reasoning algorithm is to improve upon the DL model’s label prediction outcome by leveraging the knowledge about label co-occurrences. The outcome of the DL module (i.e. the neural network classifier’s label prediction for each image in the testing dataset) is given as an input for the reasoning module. The label prediction vector contains the DL module’s probabilistic prediction values (between 0 and 1) for all 54 labels. DL module also provides the threshold for each label. The DL model determines the threshold that maximizes the model’s performance. The threshold is used to determine if the predicted label is true (1) or false (0). If the predicted value for the label is greater or equal to the threshold, then the label is true otherwise, it is false. The reasoning module spilt the output from the DL module into training (80%) and testing (20%) datasets. The training dataset is used for the reasoning algorithm development while the testing dataset is used to test the performance of the reasoning algorithm on the unseen data.

Following are the major steps involved in the algorithm:

For each image in the development set, transform the DL module’s output prediction vector into a binary vector (0/1) using the labels’ threshold values.
Compare the DL module’s predicted binary vector (from step 1) with the ground truth binary vector. Based on this comparison, compute the DL module’s performance (precision, recall and F1-score). This is the baseline performance for the reasoning algorithm.
The reasoning algorithm modifies the label prediction value from DL module based on label co-occurrence scores. The four algorithm parameters are:
1. Label co-occurrence boosting threshold: This parameter signifies the threshold for normalized label co-occurrence score above which the algorithm boosts (increases) the predictive value of other label classes. In other words, for a given true label_a, the algorithm increases the predictive value of other label classes if they have a normalized co-occurrence score with the label_a above the boosting threshold.
2. Label Co-occurrence discounting threshold: This parameter signifies the threshold for normalized label co-occurrence score below which the algorithm discounts (decreases) the predictive value of other label classes. In other words, for a given true label_a, the algorithm decreases the predictive value of other label classes if they have a normalized co-occurrence score with the label_a below the discounting threshold.
3. Boosting delta: This parameter signifies the value by which the algorithm increases the predictive value of co-occurred label classes.
4. Discounting delta: This parameter signifies the value by which the algorithm decreases the predictive value of co-occurred label classes
In a nutshell, given a class label (radiology finding) is true, the algorithm
1. boosts the predictive value of other label classes that have the normalized label co-occurrence score above the Label co-occurrence boosting threshold
2. discounts the predictive value of other label classes that have the normalized label co-occurrence score below the Label Co-occurrence discounting threshold.
The objective function is to find optimal values for these 4 parameters that will maximize the performance of the algorithm. To do so, we use a hyperparameter optimization technique by performing an exhaustive grid search in a 4-(parameter) dimensional space.
In order to identify the optimal values for the 4 parameters, the algorithm iterates over multiple combinations of possible values of each parameter. Following are (Table 5) the ranges and delta by which these parameters are increased in each iteration.

Table 5.

Total number of algorithm iterations 3 million, algorithm complexity n⁴

Parameter	Range	Delta increase in each iteration	Iterations
Label co-occurrence boosting threshold	0.4 -1.0	0.01	60
Label Co-occurrence discounting threshold	0 -0.2	0.01	20
Boosting delta	0.01 – 0.5	0.01	50
Discounting delta	0.01 – 0.5	0.01	50

Open in a new tab

Results

We used a vocabulary-driven concept annotation algorithm and enhanced with natural language processing to annotate 206K the CXR reports from the MIMIC III dataset with the core findings from the vocabulary. We represented these reports in the format of the core finding vectors. In order to compute the statistical correlation between selected 54 labels, we transformed the core finding vectors to label vectors based on the labels to core findings mapping. We calculated the normalized co-occurrence score between labels based on their co-occurrence with other labels across the entire reports collection. We used a deep neural network architecture for finding detection and optimized the DL module’s performance on F1-score. The outcome of the DL model is given as an input for the reasoning module. The reasoning algorithm modifies the prediction score from the DL model based on the label correlations and using a hyperparameter optimization by performing an exhaustive grid search in a 4-dimensional space. We calculated mean average label-based precision, mean average label-based recall and mean average label-based F1-score detected labels for both DL module and reasoning module. The reasoning algorithm takes DL module’s performance as the baseline and improves upon it by investigating different configurations of co-occurrence based on hyperparameter optimization. The algorithm performs 3 million iterations to find optimal values of the algorithm’s four parameters that can maximize the overall algorithm performance (F1-score).

In the “boosting only” configuration, for a true label, the algorithm boosts (increases) the predictive value of other label classes (by a boosting delta) that have the normalized label co-occurrence score above the Label co-occurrence boosting threshold. While in the “discounting only” configuration, for a true label, the algorithm discounts (decreases) the predictive value of other label classes (by a discounting delta) that have the normalized label co-occurrence score above the Label co-occurrence discounting threshold. In the “boosting and discounting” configuration, both boosting and discounting functions are performed. The results of the experiments are shown in (Table 6). The last column in the (Table 6), shows the relative percentage improvement over baseline F1-Score and it is calculate using percent increase formula ((new value− old value)/(old value))×100. The algorithm improved the baseline F1-score by 1.51% with the “boosting only” configuration. We observed significant improvement in the F1-score (by 7.30%) with the “discounting only” configuration. With both the “boosting and discounting” configuration, the reasoning algorithm significantly (9.09%) improves upon the baseline deep learning module performance.

Table 6.

Performance of deep learning module as a baseline and performance of reasoning algorithm with different configurations of the reasoning algorithm

Configuration	Algorithm	Precision	Recall	F1-Score	% improvement over baseline F1-Score
Baseline	DL Module	0.7822	0.7647	0.7733	-
With boosting only	Reasoning	0.7999	0.7708	0.7850	1.51%
With discounting only	Reasoning	0.8407	0.8193	0.8298	7.30%
With boosting and discounting (Overall)	Reasoning	0.8540	0.8335	0.8436	9.09%

Open in a new tab

Discussion and Conclusion

Currently, radiologists are overwhelmed with the rising number of medical images that they need to interpret per day. With the promise of AI and exponential growth the deep learning in the medical image analysis field, there is a hope that in the near future, some of the radiologists’ workload can be alleviated with DL based radiologist assistants. In this work, we are focusing on the CXRs as they are the most common imaging modality read by radiologists in hospitals and teleradiology practices. DL methods are state-of-the-art techniques for CXR findings detection. While there are many papers published in the area of CXR findings detection using DL techniques, there is a not much focus on leveraging knowledge from the associated radiology reports to improve DL models’ performance. CXR reports are a great source of knowledge. Tapping into such knowledge and using it in conjunction with DL models can further advance the success of DL techniques in the medical image analysis field. In this paper, we presented a knowledge-driven reasoning algorithm that leverages knowledge learned from X-ray reports to improve upon the DL module performance for CXR finding detection. The reasoning algorithm takes the DL module’s performance as the baseline and improves upon it by investigating different configurations of co-occurrence based hyperparameter optimization. The reasoning algorithm significantly improves upon the baseline deep learning module performance on F1-score (by 7.30%) with the “discounting only” configuration and (by 9.09%) with both the “boosting and discounting” configuration. These results indicate that the reasoning algorithm improves the overall performance by considerably reducing false positives.

Our results suggest that combining deep learning and knowledge from radiology reports as extra information in a hybrid framework can further enhance overall system’s performance in findings detection. Building upon prior observations is an essential source of knowledge in learning. “Knowledge-Asssisted Learning,” the blending of deep/machine learning with knowledge, can be very powerful, as shown in this study. This work can be further extended in multiple directions. One possible direction is to investigate the utilization of other information features from the radiology reports that can be incorporated into the knowledge assisted deep learning framework. The other information features include demographic information (gender, age-groups), the reason for visit information, and mentioned signs and symptoms. The DL algorithms are like a black box and they often ignores domain knowledge and structure in favor of massive data sets. The features discovered by deep learning exhibit complexity and subtlety that make them challenging to analyze and understand. The utilization of relevant knowledge facilitates explainability. This hybrid framework will potentially promote the use of knowledge in building computational models. While the impact of this approach is transferable to (and replicable in) to other domains, the clear implications are potentially immense for the healthcare domain. We hope that this work provides a step towards the usage of knowledge from radiology reports into traditional image-only based deep learning frameworks.

Figures & Table

References

1.Razzak MI, Naz S, Zaib A. In Classification in BioApps. Springer: Cham; 2018. Deep learning for medical image processing: Overview, challenges and the future; pp. 323–350. [Google Scholar]
2.Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2097–2106.
3.Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, Seekins J. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019 Jul 17;33:590–597. [Google Scholar]
4.Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Guo Y, Kakrania D, Baldwin T, Syeda-Mahmood T. Efficient clinical concept extraction in electronic medical records. In Thirty-First AAAI Conference on Artificial Intelligence. 2017. Feb 12,
6.Bar Y, Diamant I, Wolf L, Lieberman S, Konen E, Greenspan H. 2015 IEEE 12th international symposium on biomedical imaging (ISBI) IEEE; 2015 Apr 16. Chest pathology detection using deep learning with non-medical training; pp. 294–297. [Google Scholar]
7.Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning. 2014. Jan 27, p. 647655.
8.Guendel S, Grbic S, Georgescu B, Liu S, Maier A, Comaniciu D. Iberoamerican Congress on Pattern Recognition. Springer: Cham; 2018 Nov 19. Learning to recognize abnormalities in chest x-rays with location-aware dense networks; pp. 757–765. [Google Scholar]
9.Yao L, Poblenz E, Dagunts D, Covington B, Bernard D, Lyman K. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv. 2017;1710.10501 Oct 28. [Google Scholar]
10.Iandola F, Moskewicz M, Karayev S, Girshick R, Darrell T, Keutzer K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv. 2014;1404.1869 Apr 7. [Google Scholar]
11.Graves A. InSupervised sequence labelling with recurrent neural networks. Springer, Berlin, Heidelberg: 2012. Long short-term memory; pp. 37–45. [Google Scholar]
12.Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 2017:4700–4708. [Google Scholar]
13.Guan Q, Huang Y. Multi-label chest X-ray image classification via category-wise residual attention learning. Pattern Recognition Letters. 2018. Oct 23,
14.Unified Medical Language System (UMLS) https://www.nlm.nih.gov/research/umls/index.html .
15.Wang C, Moradi M, Wu J, Pillai A, Sharma A, et al. A Robust Network Architecture to Detect Normal Chest X- Ray Radiographs. IEEE International Symposium on Biomedical Imaging; 2020. [Google Scholar]
16.Lin TY, Dollar P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2117–2125.
17.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 2014;1409.1556 Sep 4. [Google Scholar]
18.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. pp. 770–778.
19.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. In2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009 Jun 20. Imagenet: A large-scale hierarchical image database; pp. 248–255. [Google Scholar]
20.Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv. 2015;1511.07122 Nov 23. [Google Scholar]
21.He K, Zhang X, Ren S, Sun J. European conference on computer vision. Springer: Cham; 2016 Oct 8. Identity mappings in deep residual networks; pp. 630–645. [Google Scholar]
22.Wu Y, He K. Group normalization. InProceedings of the European Conference on Computer Vision (ECCV) 2018. pp. 3–19.
23.Lin TY, Maji S. Improved bilinear pooling with cnns. arXiv preprint arXiv. 2017;1707.06772 Jul 21. [Google Scholar]

[r1-093_3416973] 1.Razzak MI, Naz S, Zaib A. In Classification in BioApps. Springer: Cham; 2018. Deep learning for medical image processing: Overview, challenges and the future; pp. 323–350. [Google Scholar]

[r2-093_3416973] 2.Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2097–2106.

[r3-093_3416973] 3.Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, Seekins J. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019 Jul 17;33:590–597. [Google Scholar]

[r4-093_3416973] 4.Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-093_3416973] 5.Guo Y, Kakrania D, Baldwin T, Syeda-Mahmood T. Efficient clinical concept extraction in electronic medical records. In Thirty-First AAAI Conference on Artificial Intelligence. 2017. Feb 12,

[r6-093_3416973] 6.Bar Y, Diamant I, Wolf L, Lieberman S, Konen E, Greenspan H. 2015 IEEE 12th international symposium on biomedical imaging (ISBI) IEEE; 2015 Apr 16. Chest pathology detection using deep learning with non-medical training; pp. 294–297. [Google Scholar]

[r7-093_3416973] 7.Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning. 2014. Jan 27, p. 647655.

[r8-093_3416973] 8.Guendel S, Grbic S, Georgescu B, Liu S, Maier A, Comaniciu D. Iberoamerican Congress on Pattern Recognition. Springer: Cham; 2018 Nov 19. Learning to recognize abnormalities in chest x-rays with location-aware dense networks; pp. 757–765. [Google Scholar]

[r9-093_3416973] 9.Yao L, Poblenz E, Dagunts D, Covington B, Bernard D, Lyman K. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv. 2017;1710.10501 Oct 28. [Google Scholar]

[r10-093_3416973] 10.Iandola F, Moskewicz M, Karayev S, Girshick R, Darrell T, Keutzer K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv. 2014;1404.1869 Apr 7. [Google Scholar]

[r11-093_3416973] 11.Graves A. InSupervised sequence labelling with recurrent neural networks. Springer, Berlin, Heidelberg: 2012. Long short-term memory; pp. 37–45. [Google Scholar]

[r12-093_3416973] 12.Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 2017:4700–4708. [Google Scholar]

[r13-093_3416973] 13.Guan Q, Huang Y. Multi-label chest X-ray image classification via category-wise residual attention learning. Pattern Recognition Letters. 2018. Oct 23,

[r14-093_3416973] 14.Unified Medical Language System (UMLS) https://www.nlm.nih.gov/research/umls/index.html .

[r15-093_3416973] 15.Wang C, Moradi M, Wu J, Pillai A, Sharma A, et al. A Robust Network Architecture to Detect Normal Chest X- Ray Radiographs. IEEE International Symposium on Biomedical Imaging; 2020. [Google Scholar]

[r16-093_3416973] 16.Lin TY, Dollar P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2117–2125.

[r17-093_3416973] 17.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 2014;1409.1556 Sep 4. [Google Scholar]

[r18-093_3416973] 18.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. pp. 770–778.

[r19-093_3416973] 19.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. In2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009 Jun 20. Imagenet: A large-scale hierarchical image database; pp. 248–255. [Google Scholar]

[r20-093_3416973] 20.Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv. 2015;1511.07122 Nov 23. [Google Scholar]

[r21-093_3416973] 21.He K, Zhang X, Ren S, Sun J. European conference on computer vision. Springer: Cham; 2016 Oct 8. Identity mappings in deep residual networks; pp. 630–645. [Google Scholar]

[r22-093_3416973] 22.Wu Y, He K. Group normalization. InProceedings of the European Conference on Computer Vision (ECCV) 2018. pp. 3–19.

[r23-093_3416973] 23.Lin TY, Maji S. Improved bilinear pooling with cnns. arXiv preprint arXiv. 2017;1707.06772 Jul 21. [Google Scholar]

PERMALINK

Combining Deep Learning and Knowledge-driven Reasoning for Chest X-Ray Findings Detection

Ashutosh Jadhav, Ph.D

Ken C L Wong, Ph.D

Joy T Wu, MD MPH

Mehdi Moradi, Ph.D

Tanveer Syeda-Mahmood, Ph.D

Abstract

Introduction

Related work

Materials and Methods Dataset

2.1. Preliminaries

CXR findings vocabulary

Table 1.

CXR report annotation:

2.2. Experimental set-up

Table 2.

2.3. Compute statistical correlations between labels

Table 3.

Table 4.

2.4. Deep learning Module

Figure 1.

2.5. Knowledge-driven Reasoning

Table 5.

Results

Table 6.

Discussion and Conclusion

Figures & Table

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Combining Deep Learning and Knowledge-driven Reasoning for Chest X-Ray Findings Detection

Ashutosh Jadhav, Ph.D

Ken C L Wong, Ph.D

Joy T Wu, MD MPH

Mehdi Moradi, Ph.D

Tanveer Syeda-Mahmood, Ph.D

Abstract

Introduction

Related work

Materials and Methods Dataset

2.1. Preliminaries

CXR findings vocabulary

Table 1.

CXR report annotation:

2.2. Experimental set-up

Table 2.

2.3. Compute statistical correlations between labels

Table 3.

Table 4.

2.4. Deep learning Module

Figure 1.

2.5. Knowledge-driven Reasoning

Table 5.

Results

Table 6.

Discussion and Conclusion

Figures & Table

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases