COVID-19 PNEUMONIA CHEST X-RAY PATTERN SYNTHESIS BY STABLE DIFFUSION

Zhaohui Liang; Zhiyun Xue; Sivaramakrishnan Rajaraman; Sameer Antani

doi:10.1109/ssiai59505.2024.10508671

. Author manuscript; available in PMC: 2025 Apr 14.

Published in final edited form as: Proc IEEE Southwest Symp Image Anal Interpret. 2024 Apr 29;2024:21–24. doi: 10.1109/ssiai59505.2024.10508671

COVID-19 PNEUMONIA CHEST X-RAY PATTERN SYNTHESIS BY STABLE DIFFUSION

Zhaohui Liang ¹, Zhiyun Xue ¹, Sivaramakrishnan Rajaraman ¹, Sameer Antani ¹

PMCID: PMC11995846 NIHMSID: NIHMS2068900 PMID: 40231012

Abstract

In this study, we fine-tuned a stable diffusion model to synthesize high resolution chest X-ray images (512×512) with bilateral lung edema caused by COVID-19 pneumonia using the class-specific prior preservation strategy. 300 positive images were selected from the MIDRC dataset as subject instances with an additional 400 negative images for class prior preservation. We synthesized images respectively using the new technique and the conventional technique for comparison. The synthetic images by the stable diffusion fine-tuned by the prior preservation technique have the Frechet inception distance (FID) of 9.2158 and kernel inception distance (KID) 0.0818 computed with the real positive images, which is superior to the synthetic images using the conventional methods such as WGAN and DDIM. The classification accuracy is 0.9975 with precision of 1.0 and recall of 0.9950 when the synthetic positive images with the real negative images were classified by a trained vision transformer (ViT). We conclude that the stable diffusion model can synthesize high-quality and high-resolution chest x-ray images using the prior preservation strategy with a small number of real images as subject instances and text prompt as guidance for the designated patterns.

Index Terms—: stable diffusion, prior preservation, chest x-ray, image synthesis, latent diffusion model

1. INTRODUCTION

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), emerging in the late 2019, has caused massive loss of life and economic disruption globally. The World Health Organization (WHO) Coronavirus (COVID-19) Dashboard shows that the global death toll is approaching to 7 million with over 771 million confirmed cases [1]. On the other hand, the COVID-19 pandemic expedited the innovation and application of new technologies for detection, diagnosis, evaluation, and prognosis of COVID-19 disease. For example, following the early clinical observation on COVID-19 in Wuhan, China that reported that the unique bilateral or unilateral multiple mottling and ground-glass opacity patterns were identified on the chest X-ray (CXR) and CT images [2]. Based on this finding, a variety of machine learning methods, particularly using deep learning, were proposed to detect such visual patterns on the images. However, these new methods have multiple challenges regarding the accuracy, reliability, and robustness in field use case. A review on deep learning for COVID-19 image pattern detection pointed out the main challenges include the lack of image quality assurance, imbalance and diversity in the image datasets, and the generalization and reproducibility of the available models and algorithms [3].

Generative models such as generative adversarial network (GAN) and diffusion model provide an effective way to enrich the data diversity, which could help remedy some of the above challenges. Stable diffusion is a latent diffusion model (LDM) proposed in 2022 [4]. Instead of sampling from a pure Gaussian distribution, LDM takes the random noise from the encoded latent space by the pretrained embedding models. Therefore, the image generation runtime of LDM is significantly lower than the pure diffusion models. In addition, a new technology named Dreambooth is introduced to fine tune the stable diffusion model to generate high-quality, high-fidelity, and high-resolution synthetic images with desired patterns with a small dataset [5], however, it is demonstrated on non-medical images. Motivated by its success, in this study we aim to explore encoding medical visual patterns into a pretrained LDM and respectively bind them with medical knowledge.

In this paper, we first explain the importance of detecting lung edema on CXR images of COVID-19 pneumonia. Next, we briefly present the LDM and our experiments to fine tune a stable diffusion model. The synthetic images are compared with the images generated by Wasserstein GAN with gradient penalty (WGAN-GP) and denoising diffusion implicit model (DDIM) respectively. Finally, we summarize our findings in the conclusion section.

2. PULMONARY EDEMA DETECTION

Pulmonary edema of pneumonia is a key indicator of acute respiratory distress syndrome (ARDS) [6], which is used to estimate the severity of SARS-CoV-2 infection and the disease prognosis. A cohort study found that the time of mechanical ventilation for patients was quantitatively related to the degree of lung edema score [7]. The radiographic assessment of the lung edema (RALE) scoring system is a quantitative tool to measure the alveolar opacities on chest radiographs that reflect the degree of pulmonary edema [8]. A multicenter study found the progressive increase of RALE score was associated to higher mortality and longer use of ventilator [9]. The mRALE is a modified RALE scoring system whose score range is simplified from 0 to 24, where mRALE = 0 means no pulmonary edema and mRALE = 24 indicates severe bilateral lung edema. It offers a handy tool to assess the severity of COVID-19 pneumonia [10]. If we can use a generative model to synthesize CXR images with text guidance to control the presence of pulmonary edema, it could provide an effective solution to enrich the diversity and robustness of the current dataset for automated diagnosis and pattern detection for COVID-19 or other new or rare disease.

3. DIFFUSION MODEL FOR IMAGE SYNTHESIS

Diffusion models outperform existing image generative models such as VAE (variational autoencoder), GAN, and flow models [11]. The adversarial optimization nature of GAN makes it vulnerable to mode collapse and low diversity in training [12]. In contrast, the diffusion models use the Markovian process to model the image generation process by adding random noise sampled from Gaussian distributions during the forward diffusion process, then it uses a neural network to predict and remove the noise to generate new images during the reverse diffusion process.

3.1. Denoising diffusion models

The general architecture of a denoising diffusion probabilistic model (DDPM) [13] is illustrated as below.

The diffusion process is divided into $T$ steps, where the Gaussian noise $ϵ ~ (0,1)$ is incrementally added to the real image ( $x_{0}$ ) weighted by a random $β \in (0,1)$ constrained by $α^{2} + β^{2} = 1^{2} . α$ is considered a scaler to weight the original pixel value in the image at any step $t = T, T - 1, \dots, 1$ . With use of the reparameterization trick, the pixel values at any time step $t$ is presented as:

x_{t} = \sqrt{1 - a_{t} a_{t - 1} a_{t - 2 \dots} a_{1}} * ϵ + \sqrt{a_{t} a_{t - 1} a_{t - 2 \dots} a_{1}} * x_{0} = \sqrt{1 - {\overline{α}}_{t}} * ϵ + \sqrt{{\overline{α}}_{t}} * x_{0}, let \overline{α} = \prod_{0}^{T} a_{t}

(1)

Then the forward diffusion process at step $t \in (1, T)$ can be modeled by a function $q (x_{t} ∣ x_{0})$ defined by:

q (x_{t} ∣ x_{0}) = \sqrt{{\overline{a}}_{t}} x_{0} + \sqrt{1 - {\overline{a}}_{t}} ϵ, ϵ ~ N (0,1)

(2)

In Eq. 2, the term ${\overline{a}}_{t}$ can be computed by a linear scheduler with the value between 1 × 10⁻⁴ and 2 × 10⁻² [13]. We can use a closed-form solution of $q (x_{t} ∣ x_{0})$ to add noise $ϵ$ to the original image $x_{0}$ to attain the noised image $x_{t}$ at step t.

In the reverse diffusion process, we use a neural network to model the noised image at any step $t$ defined by:

p_{θ} (x_{t - 1} ∣ x_{t}) ≔ N (x_{t - 1}; μ_{θ} (x_{t}, t), ϵ_{θ} (x_{t}, t))

(3)

where $θ$ represents the learnable parameters of the neural network. To improve the training stability, we can replace $ϵ_{θ} (x_{t})$ with $β_{t}$ [13], representing the noise variance at step t defined by $ϵ_{θ} (\sqrt{{\overline{α}}_{t}} x_{0} + \sqrt{1 - {\overline{α}}_{t}} ϵ)$ . To optimize the model parameter $θ$ , we use the Kullback–Leibler (KL) divergence loss as objective for model optimization, where $p (x_{t - 1} ∣ x_{t})$ is the predict distribution and $q (x_{t} ∣ x_{t - 1})$ , is the ground true distribution:

L_{DDPM} = E_{t, x_{0}, ϵ} [{‖ϵ - ϵ_{θ} (\sqrt{{\overline{α}}_{t}} x_{0} + \sqrt{1 - {\overline{α}}_{t}} ϵ)‖}^{2}]

(4)

The DDPM succeeded in synthesizing high-quality and high-resolution images, but it must follow the Markovian process with many reversed steps (800~1000) to synthesize images. In addition, DDPM uses the identical large dimensionality for the latent variables as the input images. Thus, DDPM consumes expensive runtime and memory for both training and image generation which constrained its application to medical imaging.

3.2. Stable diffusion

The stable diffusion method is based on the latent diffusion model (LDM) architecture with multiple pretrained embedding models to reduce the dimensionality of the feature maps, resulting in significantly improved computational resource use during model training and inference [4].

There are three main components in the LDM architecture: a VAE, a text encoder $(τ_{θ})$ , and a UNet with multi-head cross-attentions layers. The VAE encoder converts the input images into a lower dimensional latent representation as input to the forward diffusion process, and the decoder is responsible for converting the latent representations back to real images during the reverse diffusion process. The text encoder takes the text prompts as guidance for image synthesis and converts them into embedding vectors with a language model $τ_{θ}$ . The encoded information is mapped to the UNet via the multi-head cross-attention layers. In addition, other spatially aligned inputs such as semantic map, representation, and images can be concatenated into the embedding as extra conditionings. Finally, the inner diffusion model (UNet) takes the encoded feature representations and acts as a conditional image generator by augmenting denoising process with cross-attentions.

The LDM optimization is similar as the general diffusion models. The main difference is that it converts the pixel inputs into a latent input $z_{t}$ through the encoder of the VAE to replace the generated image $x_{t}$ with a constraint $τ_{θ} (y)$ to the UNet for the diffusion and denoising process. The loss objective is correspondingly revised as:

L_{L D M} = E_{t, z_{0}, ϵ, y} [{‖ϵ - ϵ_{θ} (z_{t}, t, τ_{θ} (y))‖}^{2}]

(5)

Benefited by the low dimensionality of the latent space, the denoising processing of the LDM is significantly faster than the pure diffusion models.

4. CONCEPT BINDING BY PRIOR PRESERVATION

4.1. Fine-tuning with class prior preservation

Another main challenge for using the stable diffusion model for synthesizing the CXR images with pulmonary edema patterns is the insufficient number of training examples. To address this issue, we used the prior preservation technique to fine tune the stable diffusion 2.1 model pretrained by large image datasets including ImageNet-21K and MS-COCO [4]. The prior preservation technique was proposed as DreamBooth by Google Research in 2022 [5]. This fine-tuning process, illustrated in Figure 2, uses a class-specific prior preservation loss combined with the reconstruction loss as the loss objective to train the LDM to bind the concept of a new subject with the corresponding visual patterns via the text encoder and the VAE.

Figure 2. — Fine-tuning with prior preservation

Accordingly, the loss objective is revised as

L = E_{x, c, ϵ, ϵ^{'}, t} [ω_{t} {‖ {\hat{x}}_{θ} (α_{t} x + σ_{t} ϵ, c) - x ‖}_{2}^{2} + λ ω_{t^{'}} {‖ {\hat{x}}_{θ} (α_{t^{'}} x_{p r} + σ_{t^{'}} ϵ^{'}, c_{p r}) - x_{p r} ‖}_{2}^{2}]

(6)

where the first term is to constrain the generator to synthesize images with the desired lung edema pattern by the subject identifier “bilateral lung edema mRALE 24” encoded by the text-prompt encoder. The second term of the loss serves as the prior preservation term to supervise the generator from overfitting with the non-severe CXR images (mRALE <18) binding with encoded prior class identifier “chest x-ray”.

4.2. Experiments

We used the CXR image dataset from the 2023 MIDRC mRALE Mastermind Challenge [10], which contains 2,599 frontal CXR images annotated with mRALE score by medical experts. Due to hardware limitations, we randomly selected 200 images with mRALE=24 as severe lung edema subject instances and 400 images with mRALE<18 as prior class instances.

The experiments were implemented in Amazon SageMaker on a ml.p3.8xlarge instance with four Nvidia Tesla V100 GPUs and 64 GB GPU RAM. The LDM was optimized by the Adam optimizer with initial learning rate of 2 × 10⁻⁶ with mini-batch size of 1 for 20 epochs.

4.3. Assessment of synthetic image quality

The fine-tuned stable diffusion model can generate CXR images with severe bilateral edema patterns invoked by the controlled text prompt “bilateral lung edema mRALE 24 [pattern identifier] chest x-ray [class identifier]”, where the subject identifier “bilateral lung edema mRALE 24” is to control the desired patterns synthesis and the prior class identifier “chest x-ray” is to improve image diversity and to reduce overfitting. The synthesized image samples are shown in Figure 3, where the three CXR image labeled as “clear” on the upper row are real non-severe CXR images, and the other three labeled as “edema” on the lower row are synthetic CXR images with severe lung edema.

Figure 3. — Comparison of real (clear) and synthetic (edema) CXR images

We respectively trained a Wasserstein GAN with gradient penalty (WGAN-GP) and a denoising diffusion implicit model (DDIM) to synthesize the CXR images with lung edema pattern for comparison.

We computed the Frechet inception distance (FID) and Kernel inception distance (KID), considered more reliable and suitable for small dataset [14], between the synthetic images and the real severe lung edema CXR images. They provide the quantitative measurement for the fidelity and diversity of synthetic images and lower scores imply better image quality. The results are presented in Table I. The result indicates the synthetic image quality by the stable diffusion model fine-tuned with prior preservation outperforms all the comparators.

Table I.

Comparison of image quality (Lower score is better)

Model	FID	KID
Stable Diffusion with prior preservation	9.2158	0.0818
Stable Diffusion without prior preservation	20.1367	0.1135
DDIM	75.2278	0.2428
WGAN-GP	77.3235	0.2254

Open in a new tab

4.3. Classification accuracy

We also fine-tuned a pretrained vision transformer (ViT) with the whole MIDRC CXR dataset to measure the accuracy of the synthetic images synthesized by the proposed stable diffusion model carrying the desired visual patterns.

We respectively used 200 synthetic images as positive and 200 non-severe real images as negative for the classification test. The stable diffusion fine-tuned by prior preservation had high accuracy (accuracy of 0.9975, precision of 1.0, and recall of 0.995). Upon removing the class prior preservation loss, a large portion of synthetic images (16.5%) were classified as negative, thereby confirming successful binding with the knowledge representing by text.

5. CONCLUSION

We present a new method to fine-tune the state-of-the-art stable diffusion model to synthesize CXR images with pulmonary edema patterns using a medical knowledge driven text prompt. We show that the class prior preservation technique can effectively control overfitting in training and enrich image diversity. It also helps synthetize high-quality and high-resolution medical images with rare pathological patterns since normal medical images are more accessible compared to the images with disease.

Compared to DreamBooth fine-tuning method proposed by Google Research, our new method used strictly defined medical terms which were mapped to the targeted visual patterns on the CXR images. In our experiments, if the prior class identifier “chest x-ray” was removed, the synthetic images will look least authentic which is confirmed by the comparison in Table I. Similarly, if the subject identifier “bilateral lung edema mRALE 24” is omitted from the input text prompt, the synthetic images will be obvious artificial. These observations confirmed that well-defined key word control is an effective method to exert control not only to the semantic meaning of the input text prompt, but also to the visual pattern generated by the fine-tuned stable diffusion model. This new finding inspires a new approach to use stable diffusion model to synthesize complex visual patterns of medical images.

Figure 1. — General architecture of diffusion model

ACKNOWLEDGMENTS

This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health (NIH).

REFERENCES

[1].WHO, WHO Coronavirus (COVID-19) Dashboard, https://covid19.who.int, last accessed 2023/11/10.
[2].Chen N, Zhou M, Dong X, J. et al. “Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study,” Lancet, Elsevier, 395, pp. 507–513, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Aggarwal P, Mishra NK, Fatimah B, Singh P, Gupta A, and Joshi SD, “COVID-19 image classification using deep learning: Advances, challenges and opportunities,” Comput Biol Med, Elsevier, 144, pp. 105350, Mar. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B, “High-resolution image synthesis with latent diffusion models,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, USA, 2022. [Google Scholar]
[5].Ruiz N, Li Y, Jampani V, Pritch Y, Rubinstein M, and Aberman K, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2023. [Google Scholar]
[6].Matthay MA, Ware LB, and Zimmerman GA, “The acute respiratory distress syndrome,” J Clin Invest, American Society for Clinical Investigation, 122(8), pp. 2731–40, Aug. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Taniguchi H, Ohya A, Yamagata H, Iwashita M, Abe T, and Takeuchi I, “Prolonged mechanical ventilation in patients with severe COVID-19 is associated with serial modified-lung ultrasound scores: A single-centre cohort study,” PLOS One, Public Library of Science, 17(7), pp. e0271391, Jul. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Warren MA, Zhao Z, Koyama T, et al. , “Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS,” Thorax, BMJ, 73(9), pp. 840–6, Jun. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Valk CMA, Zimatore C, Mazzinari G, et al. , “The Prognostic Capacity of the Radiographic Assessment for Lung Edema Score in Patients With COVID-19 Acute Respiratory Distress Syndrome-An International Multicenter Observational Study,” Front Med (Lausanne), Frontiers Media, 8, pp. 772056, Jan. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].MIDRC, MIDRC mRALE Mastermind Challenge: AI to predict COVID severity on chest radiographs, https://www.midrc.org/mrale-mastermind-2023, last accessed 2023/11/10
[11].Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, and Lakshminarayanan B, “Normalizing flows for probabilistic modeling and inference,” The Journal of Machine Learning Research, MIT Press, 22(1), pp. 2617–80, Jan. 2021. [Google Scholar]
[12].Dhariwal P and Nichol A, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, MIT Press, USA, 34, pp. 8780–94, Dec. 2021. [Google Scholar]
[13].Ho J, Jain A, and Abbeel P, “Denoising diffusion probabilistic models,” Advances in neural information processing systems. MIT Press, USA, 33, pp. 6840–51, 2020. [Google Scholar]
[14].Borji A, “Pros and cons of gan evaluation measures,” Computer vision and image understanding, Elsevier, 179, pp. 41–65, 2019. [Google Scholar]

[R1] [1].WHO, WHO Coronavirus (COVID-19) Dashboard, https://covid19.who.int, last accessed 2023/11/10.

[R2] [2].Chen N, Zhou M, Dong X, J. et al. “Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study,” Lancet, Elsevier, 395, pp. 507–513, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Aggarwal P, Mishra NK, Fatimah B, Singh P, Gupta A, and Joshi SD, “COVID-19 image classification using deep learning: Advances, challenges and opportunities,” Comput Biol Med, Elsevier, 144, pp. 105350, Mar. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B, “High-resolution image synthesis with latent diffusion models,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, USA, 2022. [Google Scholar]

[R5] [5].Ruiz N, Li Y, Jampani V, Pritch Y, Rubinstein M, and Aberman K, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2023. [Google Scholar]

[R6] [6].Matthay MA, Ware LB, and Zimmerman GA, “The acute respiratory distress syndrome,” J Clin Invest, American Society for Clinical Investigation, 122(8), pp. 2731–40, Aug. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Taniguchi H, Ohya A, Yamagata H, Iwashita M, Abe T, and Takeuchi I, “Prolonged mechanical ventilation in patients with severe COVID-19 is associated with serial modified-lung ultrasound scores: A single-centre cohort study,” PLOS One, Public Library of Science, 17(7), pp. e0271391, Jul. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Warren MA, Zhao Z, Koyama T, et al. , “Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS,” Thorax, BMJ, 73(9), pp. 840–6, Jun. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Valk CMA, Zimatore C, Mazzinari G, et al. , “The Prognostic Capacity of the Radiographic Assessment for Lung Edema Score in Patients With COVID-19 Acute Respiratory Distress Syndrome-An International Multicenter Observational Study,” Front Med (Lausanne), Frontiers Media, 8, pp. 772056, Jan. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].MIDRC, MIDRC mRALE Mastermind Challenge: AI to predict COVID severity on chest radiographs, https://www.midrc.org/mrale-mastermind-2023, last accessed 2023/11/10

[R11] [11].Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, and Lakshminarayanan B, “Normalizing flows for probabilistic modeling and inference,” The Journal of Machine Learning Research, MIT Press, 22(1), pp. 2617–80, Jan. 2021. [Google Scholar]

[R12] [12].Dhariwal P and Nichol A, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, MIT Press, USA, 34, pp. 8780–94, Dec. 2021. [Google Scholar]

[R13] [13].Ho J, Jain A, and Abbeel P, “Denoising diffusion probabilistic models,” Advances in neural information processing systems. MIT Press, USA, 33, pp. 6840–51, 2020. [Google Scholar]

[R14] [14].Borji A, “Pros and cons of gan evaluation measures,” Computer vision and image understanding, Elsevier, 179, pp. 41–65, 2019. [Google Scholar]

PERMALINK

COVID-19 PNEUMONIA CHEST X-RAY PATTERN SYNTHESIS BY STABLE DIFFUSION

Zhaohui Liang

Zhiyun Xue

Sivaramakrishnan Rajaraman

Sameer Antani

Roles

Abstract

1. INTRODUCTION

2. PULMONARY EDEMA DETECTION