Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports

Qingqing Zhu; Tejas Sudharshan Mathai; Pritam Mukherjee; Yifan Peng; Ronald M Summers; Zhiyong Lu

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Oct 10:arXiv:2306.08749v2. Originally published 2023 Jun 14. [Version 2]

Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports^*

Qingqing Zhu ¹, Tejas Sudharshan Mathai ², Pritam Mukherjee ², Yifan Peng ³, Ronald M Summers ², Zhiyong Lu ¹

PMCID: PMC10370215 PMID: 37502627

Abstract

Despite the reduction in turn-around times in radiology reporting with the use of speech recognition software, persistent communication errors can significantly impact the interpretation of radiology reports. Pre-filling a radiology report holds promise in mitigating reporting errors, and despite multiple efforts in literature to generate comprehensive medical reports, there lacks approaches that exploit the longitudinal nature of patient visit records in the MIMIC-CXR dataset. To address this gap, we propose to use longitudinal multi-modal data, i.e., previous patient visit CXR, current visit CXR, and the previous visit report, to pre-fill the “findings” section of the patient’s current visit. We first gathered the longitudinal visit information for 26,625 patients from the MIMIC-CXR dataset, and created a new dataset called Longitudinal-MIMIC. With this new dataset, a transformer-based model was trained to capture the multi-modal longitudinal information from patient visit records (CXR images + reports) via a cross-attention-based multi-modal fusion module and a hierarchical memory-driven decoder. In contrast to previous works that only uses current visit data as input to train a model, our work exploits the longitudinal information available to pre-fill the “findings” section of radiology reports. Experiments show that our approach outperforms several recent approaches. Code will be published at https://github.com/CelestialShine/Longitudinal-Chest-X-Ray

Keywords: Chest X-Rays, Radiology reports, Longitudinal data, Report Pre-Filling, Report Generation

1. Introduction

In current radiology practice, a signed report is often the primary form of communication, to communicate results of a radiological imaging exam between radiologist. Speech recognition software (SRS), which converts dictated words or sentences into text in a report, is widely used by radiologists. Despite SRS reducing the turn-around times for radiology reports, correcting any transcription errors in the report has been assumed by the radiologists themselves. But, persistent report communication errors due to SRS can significantly impact report interpretation, and also have dire consequences for radiologists in terms of medical malpractice [1]. These errors are most common for cross-sectional imaging exams (e.g., CT, MR) and chest radiography [2]. Problems also arise when reexamining the results from external examinations and in interventional radiology procedural reports. Such errors are due to many factors, including SRS finding a nearest match for a dictated word, the lack of natural language processing (NLP) for real-time recognition and dictation conversion [2], and unnoticed typographical mistakes. To mitigate these errors, a promising alternative is to automate the pre-filling of a radiology report with salient information for a radiologist to review. This enables standardized reporting via structured reporting.

A number of methods to generate radiology reports have been proposed previously, with significant focus on CXR images [3,4,5,6,7,8,9,10,11]. Various attention mechanisms were proposed [4,12,6] to drive the encoder and the decoder to emphasize more informative words in the report, or visual regions in the CXR, and improve generation accuracy. Other approaches [8,9,10] effectively used Transformer-based models with memory matricies to store salient information for enhanced report generation quality. Despite these advances, there has been scarce research into harnessing the potential of longitudinal patient visits for improved patient care. In practice, CXR images from multiple patient visits are usually examined simultaneously to find interval changes; e.g., a radiologist may compare a patient’s current CXR to a previous CXR, and identify deterioration or improvement in the lungs for pneumonia. Reports from longitudinal visits contain valuable information regarding the patient’s history, and harnessing the longitudinal multi-modal data is vital for the automated pre-filling of a comprehensive “findings” section in the report.

In this work, we propose to use longitudinal multi-modal data, i.e., previous visit CXR, current visit CXR, and previous visit report, to pre-fill the “findings” section of the patient’s current visit report. To do so, we first gathered the longitudinal visit information for 26,625 patients from the MIMIC-CXR dataset⁴ and created a new dataset called Longitudinal-MIMIC. Using this new dataset, we trained a transformer-based model containing a cross-attention-based multi-modal fusion module and a hierarchical memory-driven decoder to capture the features of longitudinal multi-modal data (CXR images + reports). In contrast to current approaches that only use the current visit data as input, our model exploits the longitudinal information available to pre-fill the “findings” section of reports with accurate content. Experiments conducted with the proposed dataset and model validate the utility of our proposed approach. Our main contribution in this work is training a transformer-based model that fully tackles the longitudinal multi-modal patient visit data to pre-fill the “findings” section of reports.

2. Methods

Dataset.

The construction of the Longitudinal-MIMIC dataset involved several steps, starting with the MIMIC-CXR dataset, which is a large publicly available dataset of 377,110 chest X-ray images corresponding to 227,835 radiographic reports from 65,379 patients [13]. The first step in creating the Longitudinal-MIMIC dataset was to pre-process MIMIC-CXR to ensure consistency with prior works [9,8]. Specifically, patient visits where the report did not contain a “findings” section were excluded. For each patient visit, there was at least one chest X-ray image (frontal, lateral or other view) and a corresponding medical report. In our work, we only generated pre-filled reports with the “findings” section.

Next, the pre-processed dataset was partitioned into training, validation, and test sets using the official split provided with the MIMIC-CXR dataset. Table 1 shows that 26,625 patients in MIMIC-CXR had ≥ 2 visit records, providing a large cohort of patients with longitudinal study data that could be used for our goal of pre-filling radiology reports. For patients with ≥2 visits, consecutive pairs of visits were used to capture richer longitudinal information. The dataset was then arranged chronologically based on the “StudyDate” attribute present in the MIMIC-CXR dataset. “StudyDate” denotes a de-identified date linked to a radiographic analysis. While the dates undergo anonymization, they maintain a consistent chronological sequence for each individual patient.

Table 1:

A breakdown of the MIMIC-CXR dataset to show the number of patients with a specific number of visit records.

# visit records	1	2	3	4	5	>5
# patients	33,922	10,490	5,079	3,021	1,968	6,067

Open in a new tab

Following this, patients with ≥2 visit records were selected, resulting in 26,625 patients in the final Longitudinal-MIMIC dataset with a total of 94,169 samples. Each sample used during model training consisted of the current visit CXR, current visit report, previous visit CXR, and the previous visit report. The final dataset was divided into training (26,156 patients and 92,374 samples), validation (203 patients and 737 samples), and test (266 patients and 2,058 samples) splits. We aimed to create the Longitudinal-MIMIC dataset to enable the development and evaluation of models leveraging multi-modal data (CXR + reports) from longitudinal patient visits.

Model Architecture.

Figure 1 shows the pipeline to generate a pre-filled “findings” section in the current visit report $R_{C}$ , given the current visit CXR image $I_{C}$ , previous visit CXR image $I_{P}$ , and the previous visit report $R_{P}$ . Mathematically, we can write: $p (R_{C} ∣ I_{C}, I_{P}, R_{P}) = \prod_{t = 1} p (w_{t} ∣ w_{1}, \dots, w_{t - 1}, I_{C}, I_{P}, R_{P})$ , where $w_{i}$ is the $i$ -th word in the current report.

Fig. 1: — Our proposed approach uses the CXR image and report from a previous patient visit and the current visit CXR image to pre-fill the “findings” section of the current visit report. The transformer-based model uses a cross-attention-based multi-modal fusion module and a hierarchical memory-driven decoder to generate the required text.

Encoder.

Our model uses an Image Encoder and a Text Encoder to process the CXR images and text input separately. Both encoders were based on transformers. First, a pre-trained ResNet-101 [14] extracted image features $F = [f_{1}, \dots, f_{S}]$ from the CXR images, where $S$ is the number of patch features. They were then passed to the Image Encoder, which consisted of a stack of blocks. The encoded output was a list of encoded hidden states $H = [h_{1}, \dots, h_{S}]$ . The CXR images from the previous and the current visits were encoded in the same manner, and denoted by $H^{I_{P}}$ and $H^{I_{C}}$ respectively.

The Text Encoder encoded text information for language feature embedding using a previously published method [15]. First, the radiology report $R_{P}$ was tokenized into a sequence of $M$ tokens, and then transformed into vector representations $V = [v_{1}, \dots, v_{M}]$ using a lookup table [16]. They were then fed to the text encoder, which had the same architecture as the image encoder, but with distinct network parameters. The final text feature embedding $H^{R_{P}}$ was defined as: $H^{R_{P}} = θ_{R}^{E} (V)$ , where $θ_{R}^{E}$ refers to the parameters of the report text encoder.

Cross-Attention Fusion Module.

A multi-modal fusion module integrated longitudinal representations of images and texts using a cross-attention mechanism [17], which was defined as: $H^{I_{P}^{*}} = softmax (\frac{q (H^{I_{P}}) k {(H^{R_{P}})}^{⊤}}{\sqrt{d_{k}}}) v (H^{R_{P}})$ and $H^{R_{P}^{*}} = softmax (\frac{q (H^{R_{P}}) k {(H^{I_{P}})}^{⊤}}{\sqrt{d_{k}}}) v (H^{I_{P}})$ , where $q (\cdot)$ , $k (\cdot)$ , and $v (\cdot)$ are linear transformation layers applied to features of proposals. $d_{k}$ is the number of attention heads for normalization. Finally, $H^{I_{P}^{*}}$ and $H^{R_{P}^{*}}$ were concatenated to obtain the multi-modal longitudinal representations $H^{L}$ .

Hierarchical Decoder with Memory.

Our model’s backbone decoder is a Transformer decoder with multiple blocks (The architecture of an example block is shown in the supplementary material). The first block takes partial output embedding $H^{O}$ as input during training and a pre-determined starting symbol during testing. Subsequent blocks use the output from the previous block as input. To incorporate the encoded $H^{L}$ and $H^{I_{C}}$ , we use a hierarchical structure for each block that divides it into two sub-blocks: $D^{I}$ and $D^{L}$ .

Sub-block-1 uses $H^{I_{C}}$ and consists of a self-attention layer, an encoder-decoder attention layer, and feed-forward layers. It also employs residual connections and conditional layer normalization [8]. The encoder-decoder attention layer performs multi-head attention over $H^{I_{C}}$ It also uses a memory matrix $M$ to store output and important pattern information. The memory representations not only store the information of generated current reports over time in the decoder, but also the information across different encoders. Following [8], we adopted a matrix $M$ to store the output over multiple generation steps and record important pattern information. Then we enhance $M$ by aligning it with $H^{I_{C}}$ to create an attention-aligned memory $M^{I_{C}}$ matrix. Different from [8], we use $M^{I_{C}}$ while transforming the normalized data instead of $M$ . The decoding process of sub-block-1 $D^{I}$ is formalized as: $H^{dec, b, I} = D^{I} (H^{O}, H^{I_{C}}, M^{I_{C}})$ , where $b$ stands for the block index. The output of sub-block 1 is combined with $H^{O}$ through a fusion layer: $H^{d e c, b} = (1 - β) H^{O} + β H^{d e c, b, I} . β$ is a hyper-parameter to balance $H^{O}$ and $H^{dec, b, I}$ . In our experiment, we set it to 0.2.

The input to sub-block-2 $D^{L}$ is $H^{dec, b}$ . This structure is similar to sub-block1, but interacts with $H^{L}$ instead of $H^{I_{C}}$ . The output of this block is $H^{dec, b, L}$ and combined with $H^{d e c, b, I}$ by adding them together. After fusing these embeddings and doing traditional layer normalization for them, we use these embeddings as the output of a block. The output of the previous block is used as the input of the next block. After $N$ blocks, the final hidden states are obtained and used with a Linear and Softmax layer to get the target report probability distributions.

3. Experiments and Results

Baseline comparisons.

We compared our proposed method against prior image captioning and medical report generation works respectively. The same Longitudinal-MIMIC dataset was used to train all baseline models, such as AoANet [18], CNNTrans [16], Transformer [15], R2gen [8], and R2CMN [9]. Implementation of these methods is detailed in the supplementary material.

Evaluation Metrics.

Conventional natural language generation (NLG) metrics, such as BLEU [19], METEOR [20], and Rouge_L [21] were used to evaluate the utility of our approach against other baseline methods. Similar to prior work [16,8], the CheXpert labeler [22] classified the predicted report for the presence of 14 disease conditions ⁵ and compared them against the labels of the groundtruth report. Clinical Efficacy (CE) metrics, such as; accuracy, precision, recall, and F-1 score, were used to evaluate model performance.

Results.

Table 2 shows the summary of the NLG metrics and CE metrics for the 14 disease observations for our proposed approach when compared against prior baseline approaches. In particular, our model achieves the best performance over previous baselines across all NLG and CE metrics.

Table 2:

Results of the NLG metrics (BLEU (BL), Meteor (M), Rouge R_L) and clinical efficacy (CE) metrics (Accuracy, Precision, Recall and F-1 score) on the Longitudinal-MIMIC dataset. Best results are highlighted in bold.

Method	NLG metrics						CE metrics
Method	BL-1	BL-2	BL-3	BL-4	M	R _L	A	P	R	F-1
AoANet	0.272	0.168	0.112	0.080	0.115	0.249	0.798	0.437	0.249	0.317
CNN+Trans	0.299	0.186	0.124	0.088	0.120	0.263	0.799	0.445	0.258	0.326
Transformer	0.294	0.178	0.119	0.085	0.123	0.256	0.811	0.500	0.320	0.390
R2gen	0.302	0.183	0.122	0.087	0.124	0.259	0.812	0.500	0.305	0.379
R2CMN	0.305	0.184	0.122	0.085	0.126	0.265	0.817	0.521	0.396	0.449
Ours	0.343	0.210	0.140	0.099	0.137	0.271	0.823	0.538	0.434	0.480
Baseline	0.294	0.178	0.119	0.085	0.123	0.256	0.811	0.500	0.320	0.390
+ report	0.333	0.201	0.133	0.094	0.135	0.268	0.823	0.539	0.411	0.466
+ image	0.320	0.195	0.130	0.092	0.130	0.268	0.817	0.522	0.34	0.412
simple fusion	0.317	0.193	0.128	0.090	0.130	0.266	0.818	0.521	0.396	0.450

Open in a new tab

Generic image captioning approaches like AoANet resulted in unsatisfactory performance on the Longitudinal-MIMIC dataset as they failed to capture specific disease observations. Moreover, our approach outperforms previous report generation methods, R2Gen and R2CMN that also use memory-based models, due to the added longitudinal context arising from the use of longitudinal multi-modal study data (CXR images + reports). In our results, the BLEU scores show a substantial improvement, particularly in BLEU-4, where we achieve a 1.4% increase compared to the previous method R2CMN. BLEU scores measure how many continuous sequences of words appear in predicted reports, while Rouge_L evaluates the fluency and sufficiency of predicted reports. The highest Rouge_L score demonstrates the ability of our approach to generate accurate reports, rather than meaningless word combinations. We also use METEOR for evaluation, taking into account the precision, recall, and alignment of words and phrases in generated reports and the ground truth. Our METEOR score shows a 1.1% improvement over the previous outstanding method, which further solidifies the effectiveness of our approach. Meanwhile, our model exhibits a significant improvement in clinical efficacy metrics compared to other baselines. Notably, F1 is the most important metric, as it provides a balanced measure of both precision and recall. Our approach outperforms the best-performing method by 3.1% in terms of F1 score. These observations are particularly significant, as higher NLG scores do not necessarily correspond to higher clinical scores [8], confirming the effectiveness of our proposed method.

Effect of Model Components.

We also studied the contribution of different model components and detail results in Table 2. The Baseline experiment refers to a basic Transformer model trained to generate a pre-filled report given a chest CXR image without any additional longitudinal information. The NLG and CE metrics are poor for the vanilla transformer compared to our proposed approach. We also analyze the contributions of the previous chest CXR image + image and previous visit report + report when added to the model separately. These two experiments included memory-enhanced conditional normalization. We observed that with each added feature enhanced the pre-filled report quality compared to the baseline, but the previous visit report had a higher impact than the previous CXR image. We hypothesize that the previous visit reports contain more text that can be directly transferred to the current visit reports. In our simple fusion experiment, we removed the cross-attention module and concatenated the encoded embeddings of the previous CXR image and previous visit report as one longitudinal embedding, while retaining the rest of the model. We saw a performance drop compared to our approach on our dataset, and also noticed that the results were worse than using the images or reports alone. These experiments demonstrate the utility of the cross-attention module in our proposed work.

4. Discussion and Conclusion

Case Study.

We also ran a qualitative evaluation of our proposed approach on two cases as seen in Fig. 2. In these cases, we compare our generated report with the report generated by the R2Gen. In the first case, certain highlighted words in purple, such as “status post”, “aortic valve” and “cardiac silhouette in the predicted current visit report are also seen in the previous visit report. The CheXpert classified “Labels” also show the pre-filled “findings” generated is highly consistent with the ground truth report in contrast to the R2Gen model. For example, the “cardiac silhouette enlarged” was not generated by the R2Gen model, but our prediction contains them and is consistent with the word “cardiomegaly” in the ground truth report. In the second case, our generated report is also superior. Not only does our report generate more of the same content as the ground truth, but the positive diagnosis labels classified by CheXpert in our report are completely consistent with those in the ground truth. We also provide more cases in the supplementary material.

Error Analysis.

To analyze errors from our model, we examine generated reports alongside ground truths and longitudinal information. It is found that the label accuracy of the observations in the generated reports is greatly affected by the previous information. For example, as time changes, for the same observation “pneumothorax”, the label can change from “positive” to “negative”. And such changing examples are more difficult to generate accurately. According to our statistics, on the one hand, when the label results of current and previous report are the same, 88.96% percent of the generated results match them. On the other hand, despite mentioning the same observations, when the labels of current and previous report are different, there is an 84.42% probability of generated results being incorrect. Thus how to track and generate the label accurately of these examples is a possible future work to improve the generated radiology reports. One possible way to address this issue is to use active learning [23] or curriculum learning [24] methods to differentiate different types of samples and better train the machine learning models.

Conclusion.

In this paper, we propose to pre-fill the “findings” section of chest X-Ray radiology reports by considering the longitudinal multi-modal (CXR images + reports) information available in the MIMIC-CXR dataset. We gathered 26,625 patients with multiple visits to constitute the new Longitudinal-MIMIC dataset, and proposed a model to fuse encoded embeddings of multi-modal data along with a hierarchical memory-driven decoder. The model generated a pre-filled “findings” section of the report, and we evaluated the generated results against prior image captioning and medical report generation works. Our model yielded a ≥ 3% improvement in terms of the clinical efficacy F-1 score on the Longitudinal-MIMIC dataset. Moreover, experiments that evaluated the utility of different components of our model proved its effectiveness for the task of pre-filling the “findings” section of the report.

Acknowledgements

This research was supported by the Intramural Research Program of the National Library of Medicine and Clinical Center at the NIH. The authors thank to Qingyu Chen and Xiuying Chen for their time and effort in providing thoughtful comments and suggestions to revise this paper. This work was also supported by the National Institutes of Health under Award No. 4R00LM013001 (Peng), NSF CAREER Award No. 2145640 (Peng), and Amazon Research Award (Peng).

Footnotes

⁴

https://physionet.org/content/mimic-cxr-jpg/2.0.0/

⁵

No Finding, Enlarged Cardiomediastinum, Cardiomegaly, Lung Lesion, Airspace Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion, Pleural Other, Fracture and Support Devices

References

1.Smith John J. and Berlin Leonard. Signing a colleague’s radiology report. American Journal of Roentgenology, 176(1):27–30, 2001. [DOI] [PubMed] [Google Scholar]
2.Ringler Michael D, Goss Brian C, and Bartholmai Brian J. Syntactic and semantic errors in radiology reports associated with speech recognition software. Health Informatics Journal, 23(1):3–13, 2017. [DOI] [PubMed] [Google Scholar]
3.Shin Hoo-Chang, Roberts Kirk, Lu Le, Demner-Fushman Dina, Yao Jianhua, and Summers Ronald M. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2497–2506, 2016. [Google Scholar]
4.Jing Baoyu, Xie Pengtao, and Xing Eric. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2577–2586, 2018. [Google Scholar]
5.Li Yuan, Liang Xiaodan, Hu Zhiting, and Xing Eric P. Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems, 31, 2018. [Google Scholar]
6.Wang Xiaosong, Peng Yifan, Lu Le, Lu Zhiyong, and Summers Ronald M. Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058, 2018. [Google Scholar]
7.Jing Baoyu, Wang Zeya, and Xing Eric. Show, describe and conclude: On exploiting the structure information of chest x-ray reports. arXiv preprint arXiv:2004.12274, 2020. [Google Scholar]
8.Chen Zhihong, Song Yan, Chang Tsung-Hui, and Wan Xiang. Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, 2020. [Google Scholar]
9.Chen Zhihong, Shen Yaling, Song Yan, and Wan Xiang. Cross-modal memory networks for radiology report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, 2021. [Google Scholar]
10.Wang Jun, Bhalerao Abhir, and He Yulan. Cross-modal prototype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022. [Google Scholar]
11.Liu Fenglin, Wu Xian, Ge Shen, Fan Wei, and Zou Yuexian. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13753–13762, 2021. [Google Scholar]
12.Xue Yuan, Xu Tao, Long L. Rodney, Xue Zhiyun, Antani Sameer Kiran, Thoma George R., and Huang Xiaolei. Multimodal recurrent model with attention for automated radiology report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018. [Google Scholar]
13.Johnson Alistair EW, Pollard Tom J, Greenbaum Nathaniel R, Lungren Matthew P, Deng Chih-ying, Peng Yifan, Lu Zhiyong, Mark Roger G, Berkowitz Seth J, and Horng Steven. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Simonyan Karen and Zisserman Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]
15.Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Lukasz, and Polosukhin Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]
16.Moon Jong Hak, Lee Hyungyung, Shin Woncheol, and Choi Edward. Multi-modal understanding and generation for medical images and text via vision-language pretraining. arXiv preprint arXiv:2105.11333, 2021. [DOI] [PubMed] [Google Scholar]
17.Nagrani Arsha, Yang Shan, Arnab Anurag, Jansen Aren, Schmid Cordelia, and Sun Chen. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021. [Google Scholar]
18.Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4634–4643, 2019. [Google Scholar]
19.Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In ACL 2002, pages 311–318. [Google Scholar]
20.Denkowski Michael and Lavie Alon. Meteor universal: Language specific translation evasukhbaatar2015endluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014. [Google Scholar]
21.Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. [Google Scholar]
22.Irvin Jeremy, Rajpurkar Pranav, Ko Michael, Yu Yifan, Ciurea-Ilcus Silviana, Chute Chris, Marklund Henrik, Haghgoo Behzad, Ball Robyn, Shpanskaya Katie, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019. [Google Scholar]
23.Settles Burr. Active learning, volume 6 of. Synthesis Lectures on Artificial Intelligence and Machine Learning, pages 1–114, 2012.
24.Bengio Yoshua, Louradour Jérôme, Collobert Ronan, and Weston Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. [Google Scholar]

[R1] 1.Smith John J. and Berlin Leonard. Signing a colleague’s radiology report. American Journal of Roentgenology, 176(1):27–30, 2001. [DOI] [PubMed] [Google Scholar]

[R2] 2.Ringler Michael D, Goss Brian C, and Bartholmai Brian J. Syntactic and semantic errors in radiology reports associated with speech recognition software. Health Informatics Journal, 23(1):3–13, 2017. [DOI] [PubMed] [Google Scholar]

[R3] 3.Shin Hoo-Chang, Roberts Kirk, Lu Le, Demner-Fushman Dina, Yao Jianhua, and Summers Ronald M. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2497–2506, 2016. [Google Scholar]

[R4] 4.Jing Baoyu, Xie Pengtao, and Xing Eric. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2577–2586, 2018. [Google Scholar]

[R5] 5.Li Yuan, Liang Xiaodan, Hu Zhiting, and Xing Eric P. Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems, 31, 2018. [Google Scholar]

[R6] 6.Wang Xiaosong, Peng Yifan, Lu Le, Lu Zhiyong, and Summers Ronald M. Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058, 2018. [Google Scholar]

[R7] 7.Jing Baoyu, Wang Zeya, and Xing Eric. Show, describe and conclude: On exploiting the structure information of chest x-ray reports. arXiv preprint arXiv:2004.12274, 2020. [Google Scholar]

[R8] 8.Chen Zhihong, Song Yan, Chang Tsung-Hui, and Wan Xiang. Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, 2020. [Google Scholar]

[R9] 9.Chen Zhihong, Shen Yaling, Song Yan, and Wan Xiang. Cross-modal memory networks for radiology report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, 2021. [Google Scholar]

[R10] 10.Wang Jun, Bhalerao Abhir, and He Yulan. Cross-modal prototype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022. [Google Scholar]

[R11] 11.Liu Fenglin, Wu Xian, Ge Shen, Fan Wei, and Zou Yuexian. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13753–13762, 2021. [Google Scholar]

[R12] 12.Xue Yuan, Xu Tao, Long L. Rodney, Xue Zhiyun, Antani Sameer Kiran, Thoma George R., and Huang Xiaolei. Multimodal recurrent model with attention for automated radiology report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2018. [Google Scholar]

[R13] 13.Johnson Alistair EW, Pollard Tom J, Greenbaum Nathaniel R, Lungren Matthew P, Deng Chih-ying, Peng Yifan, Lu Zhiyong, Mark Roger G, Berkowitz Seth J, and Horng Steven. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Simonyan Karen and Zisserman Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]

[R15] 15.Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Lukasz, and Polosukhin Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]

[R16] 16.Moon Jong Hak, Lee Hyungyung, Shin Woncheol, and Choi Edward. Multi-modal understanding and generation for medical images and text via vision-language pretraining. arXiv preprint arXiv:2105.11333, 2021. [DOI] [PubMed] [Google Scholar]

[R17] 17.Nagrani Arsha, Yang Shan, Arnab Anurag, Jansen Aren, Schmid Cordelia, and Sun Chen. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021. [Google Scholar]

[R18] 18.Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4634–4643, 2019. [Google Scholar]

[R19] 19.Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In ACL 2002, pages 311–318. [Google Scholar]

[R20] 20.Denkowski Michael and Lavie Alon. Meteor universal: Language specific translation evasukhbaatar2015endluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014. [Google Scholar]

[R21] 21.Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. [Google Scholar]

[R22] 22.Irvin Jeremy, Rajpurkar Pranav, Ko Michael, Yu Yifan, Ciurea-Ilcus Silviana, Chute Chris, Marklund Henrik, Haghgoo Behzad, Ball Robyn, Shpanskaya Katie, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019. [Google Scholar]

[R23] 23.Settles Burr. Active learning, volume 6 of. Synthesis Lectures on Artificial Intelligence and Machine Learning, pages 1–114, 2012.

[R24] 24.Bengio Yoshua, Louradour Jérôme, Collobert Ronan, and Weston Jason. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. [Google Scholar]

PERMALINK

This is a preprint.

Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports^*

Qingqing Zhu

Tejas Sudharshan Mathai

Pritam Mukherjee

Yifan Peng

Ronald M Summers

Zhiyong Lu

Abstract

1. Introduction

2. Methods

Dataset.

Table 1:

Model Architecture.

Fig. 1:

Encoder.

Cross-Attention Fusion Module.

Hierarchical Decoder with Memory.

3. Experiments and Results

Baseline comparisons.

Evaluation Metrics.

Results.

Table 2:

Effect of Model Components.

4. Discussion and Conclusion

Case Study.

Fig. 2:

Error Analysis.

Conclusion.

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports*

Qingqing Zhu

Tejas Sudharshan Mathai

Pritam Mukherjee

Yifan Peng

Ronald M Summers

Zhiyong Lu

Abstract

1. Introduction

2. Methods

Dataset.

Table 1:

Model Architecture.

Fig. 1:

Encoder.

Cross-Attention Fusion Module.

Hierarchical Decoder with Memory.

3. Experiments and Results

Baseline comparisons.

Evaluation Metrics.

Results.

Table 2:

Effect of Model Components.

4. Discussion and Conclusion

Case Study.

Fig. 2:

Error Analysis.

Conclusion.

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports^*