MedIENet: medical image enhancement network based on conditional latent diffusion model

Weizhen Yuan; Yue Feng; Tiancai Wen; Guancong Luo; Jiexin Liang; Qianshuai Sun; Shufen Liang

doi:10.1186/s12880-025-01909-5

. 2025 Sep 26;25:372. doi: 10.1186/s12880-025-01909-5

MedIENet: medical image enhancement network based on conditional latent diffusion model

Weizhen Yuan ¹, Yue Feng ^1,^✉, Tiancai Wen ^2,³, Guancong Luo ¹, Jiexin Liang ¹, Qianshuai Sun ¹, Shufen Liang ^1,^✉

PMCID: PMC12465763 PMID: 41013286

Abstract

Background

Deep learning necessitates a substantial amount of data, yet obtaining sufficient medical images is difficult due to concerns about patient privacy and high collection costs.

Methods

To address this issue, we propose a conditional latent diffusion model-based medical image enhancement network, referred to as the Medical Image Enhancement Network (MedIENet). To meet the rigorous standards required for image generation in the medical imaging field, a multi-attention module is incorporated in the encoder of the denoising U-Net backbone. Additionally Rotary Position Embedding (RoPE) is integrated into the self-attention module to effectively capture positional information, while cross-attention is utilised to embed integrate class information into the diffusion process.

Results

MedIENet is evaluated on three datasets: Chest CT-Scan images, Chest X-Ray Images (Pneumonia), and Tongue dataset. Compared to existing methods, MedIENet demonstrates superior performance in both fidelity and diversity of the generated images. Experimental results indicate that for downstream classification tasks using ResNet50, the Area Under the Receiver Operating Characteristic curve (AUROC) achieved with real data alone is 0.76 for the Chest CT-Scan images dataset, 0.87 for the Chest X-Ray Images (Pneumonia) dataset, and 0.78 for the Tongue Dataset. When using mixed data consisting of real data and generated data, the AUROC improves to 0.82, 0.94, and 0.82, respectively, reflecting increases of approximately 6%, 7%, and 4%.

Conclusion

These findings indicate that the images generated by MedIENet can enhance the performance of downstream classification tasks, providing an effective solution to the scarcity of medical image training data.

Keywords: Medical image generation, Data enhancement, Diffusion model, Attention mechanism

Introduction

With the ongoing advancement of deep learning technology, the processing and analysis of medical images has become increasingly widespread. Due to its minimal dependency on specific tasks and high generalizability, deep learning is extensively used in medical image segmentation, classification, and generation, significantly supporting clinical decision-making and treatment planning [1]. However, deep learning still relies on large and representative datasets for training, testing, and validation. In medical imaging, obtaining sufficient datasets is more challenging compared to natural images, due to factors such as imaging costs, annotation costs, and patient privacy, which complicate network training. Additionally, for some diseases, the data collected for certain categories may be too small, leading to imbalanced data categories. This imbalance can result in insufficient prediction outcomes or introduces bias into the model, which affects the fairness and reliability of intelligent systems [2]. Therefore, the “data hunger” of deep learning is especially pronounced in the medical field, making data augmentation particularly important. Early approaches to mitigate this issue include traditional data augmentation methods applied to existing datasets, such as scaling, rotation, and affine transformations. While these methods can increase the number of samples, the newly generated samples are often too similar to the original ones, offering only limited intrinsic diversity [3].

In recent years, to tackle the challenges posed by insufficient medical image data, Generative Adversarial Networks (GANs) have been widely applied in the medical field [4]. GANs have achieved significant success across various tasks. For example, Salehinejad et al. [5] utilized a Deep Convolutional GAN (DCGAN) to generate X-ray images, demonstrating that combining DCGAN-generated images with real ones can improve image detection and classification outcomes. In terms of medical image translation, Armanious et al. [6] proposed a GAN architecture for end-to-end image conversion, showing superior performance in tasks such as PET-CT conversion, MR motion artifact correction, and PET image denoising. Amirrajab et al. [7] introduced an image segmentation and generation framework based on Mask Conditional GAN, finding that even with considerably less training data, it is possible to generate highly realistic cardiovascular magnetic resonance (CMR) images with accurate and 3D-consistent anatomical structures. Despite their successes, GANs inevitably face challenges, such as training instability and mode collapse, due to inherent architectural limitations [8]. Training instability often arises when the generator and the discriminator fail to maintain an ideal balance during training, leading to inconsistencies in the quality of the generated images. Additionally, mode collapse refers to the generator’s tendency to produce a narrow set of similar images, neglecting the diversity within the dataset. These challenges hinder the effectiveness and broader adoption of GANs in practical applications.

Recently, Denoising Diffusion Probabilistic Models (DDPM) [9] and Latent Diffusion Models (LDM) [10] have outperformed GANs in natural images [11], and researchers have started exploring the application of diffusion models in medical imaging. However, since DDPM operates in pixel space, the model training and inference processes are computationally expensive. Müller-Franzes et al. [12] introduced a novel latent denoising diffusion probabilistic model, medfusion, and compared it with GANs across various modalities of medical image datasets. Their findings revealed that medfusion outperformed GANs in both the diversity and fidelity of generated images. However, this study did not compare medfusion with other diffusion models. Moghadam et al. [13] introduced a diffusion probabilistic model to generate high-quality histopathology images of brain cancer. While their approach produced impressive results, it was limited to histopathology images, reducing its generalizability. To address the aforementioned challenges, this paper proposes a Medical Image Enhancement Network (MedIENet) based on a conditional latent diffusion generative model. The main contributions of this paper are summarized as follows:

To address the computational challenges of training diffusion models for image generation, this paper adopts a conditional latent diffusion model to build a medical image data augmentation network for class-conditional image generation. Furthermore, Perceptual Prioritized Weighting (P2W) is incorporated during training to enable the learning of richer visual concepts within fewer diffusion steps. By the this model, the network captures data distributions from real data and produces high-quality medical image data, effectively increasing the number of training samples.
In the multi-attention denoising U-Net network, a multi-attention module is introduced, leveraging channel attention, spatial attention, self-attention, and cross-attention to enhance the network’s learn features capabilities. Additionally, Rotary Position Embedding (RoPE) is integrated into the self-attention component to facilitate the learning of positional encodings.
To validate the effectiveness of the proposed model in enhancing classification tasks, we used mixed data composed of network-generated images and real images on ResNet50. The results show that this approach improves classification performance, effectively meeting the classification network’s demand for high-quality medical image data.

The remainder of this paper is organized as follows: Section 2 reviews related work on generative networks in the field of medical image generation. Section 3 describes our methodology in detail. Section 4 presents the experimental results and analysis. Additionally, Sect. 5 highlights the limitations of the work and discusses future directions. Finally, Sect. 6 provides a summary of the entire paper.

Related work

Data augmentation based on generative models

In the field of medical imaging, it is well-recognised data is often limited, and the performance of deep learning models is heavily depends on the size of the training dataset. In recent years, with advancements in generative models, researchers have explored their use to generate additional models training samples. By training on real-world data, generative models can produce images that reflect real-world variations and offer potentially unlimited image samples. The most commonly used generative networks include GANs and diffusion models.

GANs generate images through adversarial learning between a generator and a discriminator. However, during training, GANs face challenges such as instability and low image quality. In the medical field, various improved GAN variants have been proposed to address these issues. For instance, Javaid et al. [14] proposed a method based on Deep Convolutional GAN (DCGAN) to generate CT images, while Woodland et al. [15] employed StyleGAN-ADA for high-resolution medical image synthesis, demonstrating its effectiveness. For image classification tasks, Conditional GANs (CGANs) leverage data labels to generate images corresponding to specific categories. For example, Pang et al. [16] proposed a novel CGAN to generate images that enhance breast ultrasound images and used CNNs to classify breast lesions. To tackle class imbalance, Ding et al. [17] proposed a novel GAN that simultaneously focuses on intra-class and inter-class sample generation. Additionally, GANs have been applied to enhance minority class images through image-to-image translation. Shin et al. [18] used Pix2pixGAN [19] to transform normal MRI images into abnormal brain tumor MRI images, increasing data volume to improve segmentation performance. However, Pix2pixGAN requires paired data for training, which is often difficult to obtain in real life. In contrast, CycleGAN [20] only requires data from two domains without strict correspondence. Muramatsu et al. [21] explored generating breast tumor images from lung nodule CT images using CycleGAN under limited sample conditions, allowing the sharing of lesion samples across organs. The generated images were used to train a CNN for classifying breast masses in mammograms. Lopes et al. [22] investigated the use of CycleGAN to convert [11C]2β.-carbomethoxy-3β.-(4-fluorophenyl)tropane ([11C]CFT) PET images into [123I]2β.-carbomethoxy-3β.-(4-iodophenyl)-N-(3-fluoropropyl) nor-tropane ([123I]FP-CIT) SPECT images, in situations where there is insufficient SPECT image data but abundant PET image data. This approach assists doctors in diagnosing Parkinson’s disease.

Recently, diffusion models have gained significant attention due to their training stability, high quality of generated samples, and diversity. Sampling algorithms such as classifier-guidance [11] and classifier-free guidance [23] have facilitated conditional generation in diffusion models. DDPM [9] achieved remarkable success in natural images, by making DDPM training more stable through a gradual noise addition and removal. In medical imaging, research into diffusion models has been increasing, showcasing their potential in this field. Image generation is one of the main objectives of diffusion models, which have been widely applied across various types, including generating 2D and 3D images. Pinaya et al. [24] and Dorjsembe et al. [25] explored using latent DDPMs to generate high-quality 3D brain MRI images. Packhäuser et al. [26] employed latent diffusion models to generate high-quality chest X-ray datasets and evaluated the quality of the generated images and their feasibility as training data. Their work achieved competitive results in terms of the area under the receiver operating characteristic curve (AUROC) compared to classifiers trained on real data. Furthermore, Akrout et al. [27] demonstrated the fine-grained control of the image generation process using text prompts, showing that diffusion models can effectively generate high-quality skin images, which in turn improved the accuracy of skin classifiers. Similarly, Wang et al. [28] proposed MINIM, a unified medical image–text generative model capable of synthesizing multi-organ, multi-modality medical images from textual instructions, achieving high clinician-rated quality and boosting performance across diverse medical tasks. In the field of image-to-image translation, diffusion models have primarily focused on cross-modal translation tasks. For example, [29] used conditional DDPM and conditional score-matching diffusion models [30] to convert between CT and MRI modalities. Their results outperformed GAN-based [31] and CNN-based methods [32]. To address missing modalities, Meng et al. [33] proposed a unified multimodal conditional score generation method (UMM-CSGM), where the available modality is used as a condition to generate the missing modality. Experimental results showed that it can generate missing modality images with higher fidelity compared to state-of-the-art methods. Friedrich et al. [34] proposed a simple and effective medical image synthesis framework (WDM), which applies a diffusion model on wavelet decomposition to synthesize 3D medical images.

Attention mechanism

In the field of computer vision, early attention mechanisms include channel attention and spatial attention mechanisms [35], which are commonly used for image classification and segmentation tasks. Channel attention assigns different weights to the various channels of an image, while spatial attention assigns different weights across the spatial dimensions of the image. Self-attention calculates weights using query and key matrices, normalizes these weights, and then performs a weighted sum with the corresponding values to obtain attention. Channel and spatial attention focus more on key features, while self-attention is mainly used to establish long-range dependencies between features. Some studies combine self-attention with channel and spatial attention mechanisms to achieve their objectives [36]. Additionally, cross-attention exchanges information across different feature sets by computing query and key matrices between them, capturing multidimensional correlations.

Both DDPM and IDDPM [11] applied the self-attention mechanism to the backbone U-Net. Rombach et al. [10] proposed LDM, which enhanced the backbone U-Net by incorporating a cross-attention mechanism, allowing multiple modalities to serve as conditions for conditional generation. Moreover, the application of Transformer architectures, which are based on self-attention, has also garnered interset in diffusion models. For instance, DiT [37] is a novel Transformer-based diffusion model that replaces the commonly used U-Net backbone with a Vision Transformer (ViT) in a latent diffusion model. Their proposed method achieved state-of-the-art Fréchet Inception Distance (FID) scores on the class-conditional ImageNet 256 × 256 benchmark. Concurrently, U-ViT [38] also introduced the idea of using ViT instead of U-Net, and incorporated long skip connections. In unconditional and class-conditional image generation tasks as well as text-to-image generation tasks, U-ViT performed comparably or better than U-Net.

Method

MedIENet comprises two components, as illustrated in Fig. 1. One part is the Variational Autoencoder (VAE), composing an encoder and a decoder. The VAE compresses high-dimensional image data into a low-dimensional latent space and then reconstructs the high-dimensional image data from this latent space. The other component is a multi-attention diffusion model, featuring both a diffusion process and a reverse diffusion process. These processes take place in the latent space. Unlike DDPM, which operates in pixel space, conducting these operations in the latent space effectively reduces computational complexity and improves the model’s training efficiency. Additionally, the representations in the latent space are more compact and abstract, enabling the model to better capture the key features and structures of the data, thereby enhancing the quality of the generated images.

Variational autoencoder (VAE)

A VAE consists of two components: an Encoder and a Decoder. The Encoder maps the input image from pixel space to a latent space, which typically has a lower dimensionality. Specifically, the Encoder encodes the input data distribution into a probability distribution of latent variables, usually a Gaussian distribution. A latent variable is then sampled from this latent distribution, and the Decoder maps this latent variable back to pixel space to reconstruct the original image. After passing through the Encoder, the original 64 × 64 image is transformed into a 16 × 16 latent vector. Common metrics used to evaluate the quality of generated images with real reference include Mean Squared Error (MSE), Structural Similarity Index Measure (SSIM) [39], and Learned Perceptual Image Patch Similarity (LPIPS) [40].

MSE is a commonly used loss function that measures the average squared difference between predicted and true values. The specific expression is:

where m represents the number of samples, x is the true value of the i -th sample, which is the original image, x_i and y_i is the predicted value of the i -th sample by the model, which is the reconstructed image.

The Structural Similarity Index Measure (SSIM) is a metric used to quantify the structural similarity between two images. Unlike MSE, SSIM is designed based on the Human Visual System (HVS), making it more sensitive to local variations in the image and better suited for assessing structural similarity. SSIM value ranges from Inline graphic , where a value closer to 1 indicates higher similarity between the two images, while a value closer to −1 signifying lower similarity. SSIM evaluates three key image properties: luminance, contrast, and structure. The formula for calculating these properties between two images x and y are as follows:

The luminance is:

The contrast is:

The structure is:

Where Inline graphic and represent the mean values of x and y, respectively. and represent the variances of x and y, respectively. represents the covariance between x and y. The constants , , and are introduced to avoid division by zero, where and are typically set to 0.01 and 0.03, respectively. Lrepresents the dynamic range of the image pixel values.

Finally, the formula for calculating SSIM is shown in Eq. (5):

Where α, β, γ represent the relative importance of each metric.

LPIPS, also known as Perceptual Loss, is used to measure the difference between two images. Essentially, LPIPS calculates the distance between the feature maps of the reconstructed image and the original image using a pre-trained neural network, and applies learned parameters for weighting to better reflect human perceptual differences. A smaller LPIPS value indicates greater similarity between the images.

The formula for calculating LPIPS is shown in Eq. (6):

Where x and y represent the original image and the reconstructed image, respectively. l denotes the layers of the deep neural network, and H_l and W_l represent the height and width of the feature maps at the l-th layer. Inline graphic and represent the normalized feature vectors of the two images at position on the l-th layer. represents the weights of the l-th layer. denotes element-wise multiplication, and denotes the squared Euclidean distance.

To ensure the reconstructed image as similar as possible to the original image and that the perceptual features in the latent space closely resemble those of the original image, the objective is achieved by minimizing the reconstruction loss function Inline graphic , which comprises three components: MSE, LPIPS, and SSIM. This is done by calculating the loss during forward propagation, using backpropagation to compute the gradients, and then adjusting the model parameters with an optimizer to gradually reduce the loss. This process ensures the reconstructed image closely matches the original image in terms of pixel values, perceptual features, and structural characteristics. The total loss function is shown in Eq. (7):

Where A, B, and C are the weight coefficients used to adjust the contribution of each component of the loss to the total loss.

Multi-attention diffusion Model

The multi-attention diffusion model consists of two processes: the diffusion process and the reverse diffusion process. The diffusion process gradually adds Gaussian noise to the real data, eventually transforming it into pure noise samples. The reverse diffusion process uses a denoising model, such as a multi-attention denoising U-Net, to progressively remove the noise, gradually restoring the data from the pure noise samples.

Diffusion process

The diffusion process transforms a complex data distribution Inline graphic into a simpler data distribution . During this process, data is progressively disturbed by predefined noise scales , indexed by time steps t. The perturbed data is sampled from through the diffusion process, with each step following a Gaussian transition process, as shown in Eq. (8).

The perturbed data x_t can be directly obtained through reparameterization of x₀, as shown in Eq. (9):

Where Inline graphic , and .

Reverse diffusion process

The reverse diffusion process gradually denoises Gaussian pure noise into a clean image, ie., from Inline graphic . This process begins with and gradually denoises it into a clean image through a learned parametric model .

This objective can be simplified to optimizing noise prediction. Furthermore, to learn a conditional diffusion model Inline graphic (such as class-conditional image generation or text-to-image generation), extra conditions are incorporated into the noise prediction. In other words, a neural network (such as U-Net) is trained to predict the noise added to at a given time step t, the loss function of the U-Net is defined as shown in Eq. (13):

Where Inline graphic , c is the class condition.

The objective of the U-Net network is to learn the noise of images corrupted by different levels of noise. By incorporating Perceptual Prioritized Weighting (P2W) [41], the diffusion model’s convergence speed and performance can be improved by adjusting the weighting scheme of the objective function. The specific method is as follows:

First, design a signal-to-noise ratio Inline graphic as shown in Eq. (14). is a monotonically decreasing function, and during the reverse diffusion process, based on the SNR value, the process can be divided into three stages. The early stage is the rough stage (when the SNR is low), where the image contains a lot of noise, and the model learns coarse features (such as global color structure). The middle stage is the content stage, where the model perceives rich content. The final stage is the cleanup stage; when the SNR is high, the image contains only a small amount of noise, and the model learns imperceptible details during this stage. Since medical images are content-sensitive and require more accurate feature learning, P2W assigns minimal weight to the unnecessary cleanup stage and larger weights to the other stages, encouraging the model to learn richer visual concepts. The weighting scheme is as follows:

To achieve conditional guided generation, this paper adopts the Classifier-Free Guidance (CFG) technique. During training, CFG randomly drops the conditional information, enabling the model to learn both unconditional and conditional generation. During the generation process, CFG predicts the noise at each step using the following equation, as shown in Eq. (17):

where, Inline graphic is the noise prediction for conditional generation, is the noise prediction for unconditional generation, and ω is the balancing coefficient (guidance scale) between unconditional and conditional generation. Adjusting this coefficient allows control over the detail and fidelity of the generated results. It is worth noting that a higher guidance scale will make the generated results more aligned with the conditional information but may sacrifice some diversity; conversely, a lower guidance scale will increase the diversity of the generated results but may reduce their consistency with the conditional information.

Finally, based on the above equations, the loss function Inline graphic of the multi-attention denoising U-Net module in the reverse diffusion process is defined as:

Where Inline graphic is the DDPM weighting scheme, and γ is a hyperparameter that controls the degree to which the weight is reduced during the cleanup stage. K is another hyperparameter that prevents weight explosion at extremely low signal-to-noise ratios and determines the sharpness of the weighting scheme.

Multi-attention denoising U-Net module

The multi-attention denoising U-Net module is shown in Fig. 2. Due to its simplicity and superior performance, the U-Net structure has been widely used as the backbone network in various diffusion models. The U-Net takes an image as input, with the time step t and class condition c concatenated before being input into the U-Net to predict noise as output. Initially, the noisy data is passes through a ConBlock module, which adjusts the number of channels, and then process through the encoder and decoder. The encoder is made up of Multi-attention Block module and ResBlock module, designed to improve the model’s focus on critical information during feature extraction. The decoder consists of ResBlock modules, which gradually restore the resolution of the data through successive upsampling operations. Finally, the output is processed through a Conv1D layer to return to the data to its original number of channels. In the reverse diffusion process, the shared denoising U-Net is tasked with predicting the injected noise at any step. The time-varying nature introduced by the injected noise increases the training difficulty and instability of the denoising U-Net during the process. Additionally, unstable hidden features can destabilize the input of subsequent layers, significantly heightening the learning difficulty of these layers. Researchers have found in practice that multiplying a constant Inline graphic on the skip connections can mitigate instability and somewhat ease the training difficulty [42].

In medical image processing, extracting subtle features presents a significant challenge. To address this, we propose a multi-attention module, as depicted in Fig. 3. The process begins with the data passing through a Group Normalization layer, which organises and normalizes the data in separate groups. This approach minimizes the adverse effects of small batches sizes during model training and enhances training efficiency. By reducing the scale differences in input features, this step ensures that subsequent layers can process the data more effectively. Following this, the data is fed into the Convolutional Block Attention Module (CBAM) [35], which integrates both channel attention and spatial attention mechanisms. The channel attention module selectively emphasizes feature channels with higher contributions by learning the importance weights of each channel. It achieves this by performing global average pooling and global max pooling operations, which are then followed by a shared multilayer perceptron (MLP) to generate the channel weights. The spatial attention module, on the other hand, focuses on emphasizing important spatial in the image by learning the significance of each spatial position. It does so by applying global average pooling and global max pooling across the channel dimension, followed by the use of a convolutional layer to generate spatial weights.

We incorporate Rotary Position Embedding (RoPE) into the self-attention module to improve its capacity to capture dependencies between different positions within an image. RoPE further enhances the self-attention mechanism by combining both absolute and relative position encodings. Absolute position encoding directly adds positional information to the input embeddings, while relative position encoding focuses on the relationships between the relative positions of elements within the image. RoPE merges the strengths of both approaches by using a rotation matrix to encode positional information.

For image data, RoPE uses a rotation matrix to transform absolute position encoding into relative position encoding. The process involves the following steps:

First, absolute position encoding is generated using sine and cosine functions. Given a pixel position and an embedding vector dimension index Inline graphic , the formula for absolute position encoding is as follows:

Where Inline graphic is the dimension of the model.

Next, the rotation matrix for the 2D absolute position encoding is calculated. For the pixel position Inline graphic , the rotation matrix of its position encoding is:

Where Inline graphic and .

Finally, for each pixel position Inline graphic and its embedding , apply the rotation matrix to obtain the relative position encoding:

RoPE retains the intuitiveness of absolute position encoding while introducing the flexibility of relative position encoding, which boosts the model’s ability to process image data more effectively.

At the end of the multi-attention module, we enhance the denoising U-Net through a cross-attention module. The feature vectors processed by the self-attention module, along with the class encoding and time step, are input into the model. This approach enables the model to learn class-conditional information more effectively.

Experiments

Given that the original experimental subjects and environments differ across various models, this paper re-evaluates the results using the same experimental subjects and environment. In this section, we conduct experiments on two public datasets and one private dataset.

Datasets

This subsection introduces the three datasets used in the experiments.

Chest CT-Scan images1This public dataset is a cancer classification dataset from Kaggle, containing 1,000 chest CT slice images. It includes 338 images of adenocarcinoma, 187 images of large cell carcinoma, 260 images of squamous cell carcinoma, and 215 images of normal cells. The images are single-channel grayscale images.

Chest X-Ray Images (Pneumonia) [43] This public dataset is for pneumonia detection and contains 5,863 validated chest X-ray images, including 1,583 images labeled as normal and 4,273 images labeled as pneumonia. These images are divided into training, validation, and test sets. The chest X-rays are selected from a retrospective cohort of 1- to 5-year-old patients at the Guangzhou Women and Children’s Medical Center. All chest X-ray images were taken as part of routine clinical care. The images are single-channel grayscale images.

Tongue Dataset This is an in-house clinical tongue diagnosis image dataset. Informed consent was obtained from the subjects, and the images were captured using professional photography equipment, with annotations provided by experienced traditional Chinese medicine practitioners. The dataset consists of 1,283 images, categorized into five classes based on tongue coating color: 659 images of thick white coating, 121 images of thin white coating, 20 images of thin yellow coating, 413 images of thick yellow coating, and 70 images of gray-black coating.

In the data preprocessing of the above three datasets, all images were resized to 64 × 64 pixels and normalized to a distribution with a mean of 0 and a variance of 1.

Evaluation criteria

To objectively evaluate the performance of the network model, FID (Fréchet Inception Distance), KID (Kernel Inception Distance) [44], Precision, and Recall are used for quantitative assessment. According to the findings of [45], the implementations of FID and KID differ across various libraries, leading to significant discrepancies in the results. Therefore, we uniformly use the torchmetrics [46] library to calculate FID and KID.

FID This metric evaluates the quality of generated images by comparing their distribution with that of real images in the latent space of the Inception-V3 model. When processed through Inception-V3, the images produce 2,048-dimensional features vectors, which can be approximated as following a normal distribution. The difference between these two multivariate distributions is then measured using the Fréchet distance. The Fréchet distance quantifies the distance between the distributions by comparing their means and covariances. FID is calculated using the following formula. A smaller FID value indicates that the distributions of the generated and real images are more similar, meaning that the generated images appear more realistic.

where Tr represents the trace of the matrix, x and g represent the real and generated images, µ represent the means, and σ represent the covariance matrices.

KID Unlike FID, KID does not require the assumption of a normal distribution. KID computes the kernel function’s squared Maximum Mean Discrepancy (MMD) between the generated and real images in the Inception feature representation space, as shown in the following:

Where Inline graphic represents the real image sample set, containing m samples, and represents the generated image sample set, containing n samples. Each x and y are 2,048-dimensional vectors derived from the Inception network, and represents the kernel function, where d = 2048, which is the dimension of the feature vector.

Precision and Recall Precision and Recall [47] metrics are used to evaluate the fidelity and diversity of the generated images, respectively.

First, samples are drawn from both the real image dataset and the generated image dataset, and their feature representations are extracted using a pre-trained Inception-V3 model. Let the feature representations of the real images be denoted as Inline graphic , and the feature representations of the generated images be denoted as . The corresponding sets of feature vectors are and .

For each set of feature vectors Inline graphic , the corresponding manifold in the feature space is estimated as follows:

The pairwise Euclidean distances between all feature vectors in the set are calculated.
For each feature vector, a hypersphere is formed with a radius equal to its distance to the k-th nearest neighbor.
These hyperspheres collectively define a volume in the feature space, which serves as an estimate of the true manifold.

A binary function Inline graphic is defined to determine whether a given sample lies within the estimated manifold:

Where Inline graphic returns the k-th nearest neighbor feature vector of in the set Φ.

Precision quantifies whether each generated image lies within the estimated manifold of the real images by checking:

Recall quantifies whether each real image lies within the estimated manifold of the generated images by checking:

In summary, the feature representations of both generated and real images are extracted using a pre-trained classification network. The manifolds of these feature sets are estimated by computing the nearest neighbors for each feature vector to determine if a particular feature lies within the manifold. Precision measures the fidelity of generated samples by calculating whether each generated feature falls within the estimated manifold of the real features, representing the probability that the overlapping part of generated features aligns with the distribution of real features. Conversely, Recall measures the diversity of generated samples by assessing whether each real image is within the estimated manifold of generated images, reflecting the probability that the overlapping part of generated features with real features is within the distribution of real features.

Implementation details

The experiments were carried out on an NVIDIA Quadro RTX 6000 GPU, running an Ubuntu 16.04 system, with training conducted using PyTorch 1.11.0. We utilized the AdamW optimizer [48], with a quadratic schedule and a learning rate of 1e-4. The models were trained from scratch for a total of 3000 epochs.To generate images, we used the DPM-Solver sampling algorithm [49] with 50 sampling steps.

For comparison, we implemented several baseline models based on their official repositories and referenced publications. DDPM (+CFG) [9, 23] was sampled using 1000 steps. medfusion [12], a diffusion model designed for medical image synthesis, was sampled using 200 steps. For DiT, we adopted the DiT-B/4 [37], which consists of 12 transformer blocks and was sampled using 200 steps. To ensure a fair comparison, all models were trained from scratch for 3000 epochs, using a classifier-free guidance scale of 2 during image generation.

Quantitative comparison on generated images

Performance comparisons of generated images by MedIENet and other competing networks were conducted across three datasets: Chest CT-Scan images, Chest X-Ray Images (Pneumonia), and Tongue.

The visual results of the experiments are shown below. Specifically, Fig. 4 presents images from the Chest CT-Scan images dataset. Figure 5 shows images from the Chest X-Ray Images (Pneumonia) dataset. Figure 6 illustrates images from the Tongue dataset.

Fig. 5 — Visualization of images from the chest X-Ray images (pneumonia) dataset, categorized into normal X-rays and pneumonia

Fig. 6 — Visualization of images from the Tongue dataset, classified based on tongue coating colors into white greasy, thin white, thin yellow, yellow greasy, and dark gray

In the Chest CT-Scan images dataset, MedIENet-generated images distinctly highlight the pathological areas, while DDPM-generated images of squamous cell carcinoma appear lack clarity. For the Chest X-Ray Images (Pneumonia) dataset, the normal images are clean and clear, whereas the pneumonia images are somewhat blurry and murky. Several models achieved good results in this dataset; however, the images generated by DIT are less clear in the normal category, and the letter “R” in the generated images is not distinct.

In the Tongue dataset, where subtle tongue coating colors require precise feature extraction, MedIENet-generated images closely resemble real images in terms of color features and appear more realistic. In contrast, DIT-generated images of thin white and gray-black coatings, and DDPM-generated gray-black coatings, seem less convincing.

In summary, the images generated by MedIENet are more realistic and detailed in terms of feature structure and fine textures across all datasets.

As shown in Table 1, the performance metrics for MedIENet are significantly superior to those of other networks. Specifically, MedIENet achieved an FID score of 23.27 on the Chest CT-Scan images dataset, 8.08 on the Chest X-Ray Images (Pneumonia) dataset, and 17.13 on the Tongue dataset. These results surpass those of other networks, indicating that the images generated by MedIENet have higher fidelity and are more realistic. Notably, the FID score on the Chest X-Ray Images (Pneumonia) dataset is much lower than those on the other two datasets, suggesting that the generated X-ray images closely resemble real images. This could be attributed to the larger amount of training data in the Chest X-Ray Images (Pneumonia) dataset, as well as the fact that these images have only one channel, making them easier to model and train. Additionally, the distinct differences between pneumonia X-ray images and normal images may enable MedIENet to better capture and learn the detailed features of image categories.

Table 1.

Quantitative comparison of image generation

Dataset	Model	FID	KID	Precision	Recall
Chest CT Scan images	DDPM(+CFG) [9, 23]	40.19	0.0167	0.473	0.299
	medfusion [12]			0.860	0.739
	DIT [37]	33.60	0.0103	0.270	0.666
	MedIENet(our)	23.27	0.0026
Chest X-Ray Images (Pneumonia)	DDPM(+CFG)	12.54	0.0094	0.659	0.609
	medfusion
	DIT	12.80	0.0118	0.812	0.391
	MedIENet(our)	8.08	0.0043	0.977	0.757
Tongue Dataset	DDPM(+CFG)	24.63		0.734	0.495
	medfusion		0.0129
	DIT	25.81	0.0129	0.610	0.603
	MedIENet(our)	17.13	0.0078	0.947	0.697

Open in a new tab

Bold indiactes the best performance index value and underlined indicates the second high index value

On the Tongue dataset, MedIENet’s FID score was superior to that of other networks, demonstrating its strong performance in handling tongue images with complex textures and color variations. This suggests that MedIENet effectively captures and generates complex medical features.

MedIENet achieved the highest Recall on Chest X-Ray (Pneumonia) and Tongue datasets, and ranked second on Chest CT-Scan. This shows its generated images have both high fidelity and diversity, effectively covering real data distribution. Figure 7 intuitively illustrates the performance of each model on different datasets.

Fig. 7 — Radar chart of the synthetic images

Ablation study

To better understand the contributions of each component within the proposed Multi-attention module, we conducted ablation experiments using both the publicly available Chest X-Ray Images (Pneumonia) dataset and the in-house Tongue dataset. The following models were compared to evaluate their performance:

Baseline: A model without the Multi-Attention module, which uses a standard U-Net to predict noise.
Model 1: Baseline + self-attention.
Model 2: Baseline + self-attention (RoPE).
Model 3: Baseline + CBAM.
Model 4: Baseline + cross-attention.
Model 5: Baseline + self-attention + cross-attention.
Model 6: Baseline + self-attention (RoPE) + cross-attention.
Model 7: Baseline + CBAM + self-attention + cross-attention.
Multi-attention: Baseline + CBAM + self-attention (RoPE) + cross-attention.

The experimental results, as shown in Table 2, demonstrate that all models performed better than the Baseline across the datasets, indicating that each component contributes positively to the Baseline model’s effectiveness. Specifically, Model 1s performance improvement suggests that the self-attention mechanism enhances the network’s ability to capture global structures and fine details, making the generated images more realistic. Model 2, with the addition of RoPE, further strengthens the self-attention mechanism, allowing for better capture of the relative positional relationships within the image data, leading to improved image quality. Model 3s significant performance gain demonstrates that CBAM enhances the representation of feature maps by emphasizing critical features and positions. Model 4s introduction of the cross-attention mechanism enables the model to better integrate conditional information during image generation, resulting in images that more accurately match the input conditions. Model 5 and Model 6 combine these two attention mechanisms, allowing the network to generate high-quality images that are highly consistent with the input conditions. This consistency is crucial for image generation, where both quality and global coherence are essential. Overall, these results demonstrate that the integration of various attention mechanisms significantly enhances the model’s ability to generate images with improved fidelity and consistency.

Table 2.

Ablation experiment results

Dataset	Model	FID	KID	Precision	Recall
Chest X-Ray Images (Pneumonia)	Baseline	36.45	0.0349	0.657	0.497
	Model 1	27.96	0.0264	0.742	0.658
	Model 2	24.64		0.788	0.674
	Model 3	32.35	0.0322	0.665	0.499
	Model 4	33.59	0.0327	0.700	0.578
	Model 5	27.04	0.0246	0.763	0.626
	Model 6	26.81	0.0246	0.741	0.628
	Model 7		0.0246	0.793	0.649
	Multi-attention(Ours)	25.16	0.0213
Tongue Dataset	Baseline	31.79	0.0254	0.750	0.396
	Model 1	23.19	0.0139	0.867	0.515
	Model 2	21.76	0.0124	0.853	0.580
	Model 3	22.38	0.0128	0.881	0.480
	Model 4	26.21	0.0186	0.869	0.487
	Model 5	21.74	0.0126	0.867	0.552
	Model 6	21.64	0.0127	0.875	0.549
	Model 7
	Multi-attention(Ours)	19.19	0.0093	0.910	0.628

Open in a new tab

Bold indiactes the best performance index value and underlined indicates the second high index value

We found that the Multi-attention module, which combines all components, exhibited the most significant advantage, demonstrating that the joint contributions of these components positively impact the network’s performance. This integration leads to better fidelity, consistency, and an ability to produce realistic images that maintain both quality and structural integrity.

Downstream classification experiment

In this section, we quantitatively compare the quality of samples generated by different models to evaluate their applicability in downstream classification tasks. To do this, the training data is divided into two groups: real data and a mixture of real and generated data (in a 1:1 ratio). These datasets are then used to train a ResNet50 classification network. It is important to note that only real data were used in the test set to ensure a fair evaluation of classification performance. The performance is evaluated using the same test set across all experiments. The classification results, as presented in Table 3, show how well each model’s generated data contributes to the overall classification accuracy. In the table, the best results are highlighted in bold, and the second-best results are underlined, indicating which models perform most effectively in augmenting the training data for classification tasks.

Table 3.

Classification results

Dataset	Model	AUROC	Precision	Accuracy	F1 score
Chest CT Scan images	Real data	0.759	0.580	0.540	0.576
	DDPM(+CFG)	0.804	0.635	0.568	0.602
	DIT	0.785	0.626	0.530	0.571
	medfusion			0.597
	MedIENet(our)	0.818	0.673	0.597	0.633
Chest X-Ray Images (Pneumonia)	Real data	0.869	0.864	0.750	0.674
	DDPM(+CFG)			0.786	0.730
	DIT	0.891	0.896	0.740	0.658
	medfusion	0.922	0.925	0.766	0.695
	MedIENet(our)	0.935	0.939
Tongue Dataset	Real data	0.784	0.659	0.677	0.342
	DDPM(+CFG)		0.661	0.661	0.397
	DIT	0.780	0.668	0.606	0.329
	medfusion	0.798		0.649
	MedIENet(our)	0.823	0.690	0.677	0.340

Open in a new tab

Bold indiactes the best performance index value and underlined indicates the second high index value

From the results, it is evident that classification models trained with mixed real and generated data generally outperform the Baseline model, particularly in terms of AUROC and Precision. Among the models, MedIENet significantly stands out. In the Chest CT-Scan images dataset, MedIENet surpasses all other models in all four metrics, within an AUROC (0.818) and Precision (0.673), indicating that its generated samples are of the highest quality, greatly benefiting the classification task. For the Chest X-Ray Images (Pneumonia) dataset, although DDPM achieves the highest F1 score, MedIENet demonstrates the best performance in AUROC (0.935) and Precision (0.939), positing it as the overall superior model. Similarly, in the Tongue dataset, while DDPM achieves the highest F1 score, MedIENet performs best in AUROC (0.823) and Precision (0.690), indicating its significant advantage in handling complex textures and color variations.

In summary, classification models trained with mixed data consistently outperform the Baseline model in AUROC and Precision, demonstrating that the generative models effectively learned the feature distribution of the original datasets. This significantly improves classification performance. Notably, the data generated by MedIENet consistently enhances classification tasks across all three datasets, showing remarkable utility.

Discussion

In this work, we propose a conditional latent diffusion model based on multi-attention mechanisms. The model has demonstrated its capability to generate high-quality images, which can supplement training datasets for specific tasks, reducing the need for extensive data collection while safeguarding patient privacy. Additionally, the experimental results confirm that mixed data can enhance the performance of CNN-based classification tasks.

To further examine whether MedIENet effectively learns the features of the training data, we compared the feature distributions of generated images, real images, and images produced by other generative models using T-SNE (t-Distributed Stochastic Neighbor Embedding). Specifically, we first trained a ResNet network using only real data, then extracted features from the three types of images (real, MedIENet-generated, and other model-generated) with this trained network, and finally visualized these features using T-SNE.

Figures 8 present the results for three datasets: (1) Chest CT-Scan images—red: adenocarcinoma, green: large cell carcinoma, cyan: squamous cell carcinoma, purple: normal; (2) Chest X-Ray Images (Pneumonia)—cyan: pneumonia, red: normal; and (3) Tongue dataset—red: greasy white tongue coating, green: thin white tongue coating, cyan: thin yellow tongue coating, blue: greasy yellow tongue coating, purple: gray-black tongue coating. T-SNE, as a nonlinear dimensionality reduction method, maps high-dimensional features into a two-dimensional space, enabling intuitive visualization of feature separability.

Fig. 8 — t-SNE visualizations of feature distributions across three datasets

From these visualizations, it can be observed that MedIENet-generated samples of each category form distinct clusters in the feature space, comparable to the clustering patterns of real data. This indicates that MedIENet not only captures the characteristic features of the training data but also preserves inter-class differences during generation. It should be noted that t-SNE was used solely for qualitative visualization of feature distribution, and the primary model evaluation was based on quantitative performance metrics.

However, our research has some limitations. First, the image data were trained and generated at a low resolution, and the images were square-shaped due to computational resource constraints. Future work should investigate the model’s performance at higher resolutions. Second, the medical imaging domain often faces grapples with imbalanced class samples, small inter-class differences, and large intra-class variations. Under these conditions, our model struggle, as observed in Fig. 8, where some points from certain categories overlap with clusters from other classes. Learning features of underrepresented classes remains a challenge. Finally, the reliability of generated images used in medical classification tasks is of paramount importance. While incorporating class information during training to fit the conditional distribution of specific class samples offers limited improvements in reliability, generated samples may sometimes not match their labels. For instance, a diffusion model might generate an X-ray labeled as pneumonia, but the actual image resembles a healthy lung. Such errors in generated samples pose significant risks in clinical settings, greatly impacting clinicians’ trust in generative models. Therefore, improving the reliability of generated samples in classification tasks presents a substantial challenge that needs to be addressed in future work.

Conclusion

In this paper, we proposed MedIENet, a medical image enhancement network based on a conditional latent diffusion model, designed to generate high-quality medical images and alleviate the challenge of limited medical image datasets. To emphasize important features and positions and to enhance the expression capabilities of feature maps, we designed a multi-attention module within the encoder, allowing the model to more accurately capture critical features. Finally, we conducted experiments on three datasets to validate the generative performance of MedIENet. The experimental results demonstrate that MedIENet is capable of generating high-quality medical images, and the generated data also contribute to improved performance in downstream classification tasks.

Acknowledgements

Not applicable.

Author contributions

WY: Conceptualization, Writing – original draft & review & editing, Methodology, Formal analysis, Visualization, Validation. YF: Conceptualization, Investigation, Writing – original draft & review & editing, Supervision, Project administration, Funding acquisition. TW: Writing – review & editing. GL: Writing – review & editing. JL: Writing – review & editing. QS: Writing – review & editing, Validation. SL: Writing – review & editing.

Funding

This work was supported by the Basic Research and Applied Basic Research Key Project in General Colleges and Universities of Guang-dong Province (2021ZDZX1032); the International and Hong Kong-Macao-Taiwan High-end Talent Exchange Special Program of Guangdong Province (2020A1313030021); the Scientific Research Project of Wuyi University (2018GR003); the Jiangmen City Science and Technology Plan Project in the Social Development Field (2024).

Data availability

The tongue data used in this study is not publicly available but can be obtained from the corresponding author upon reasonable request.

Declarations

Ethics approval and consent to participate

This study was conducted in accordance with the principles of the Declaration of Helsinki. The Ethics Committee of Jiangmen Central Hospital reviewed and approved the study protocol and granted a waiver of informed consent (Approval No. [2019]18).

Consent for publication

Not applicable.

Conflict of interest

The authors declare that they have no competing interests.

Footnotes

kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images/data.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yue Feng, Email: J002443@wyu.edu.cn.

Shufen Liang, Email: J001790@wyu.edu.cn.

References

1.Allaouzi I, Ahmed MB. A novel approach for multi-label chest x-ray classification of common thorax diseases. IEEE Access. 2019;7:64279–88. [Google Scholar]
2.Kazerouni A, Aghdam EK, Heidari M, Azad R, Fayyaz M, Hacihaliloglu I, Merhof D. Diffusion models in medical imaging: a comprehensive survey. Med Image Anal. 2023;88:102846. [DOI] [PubMed]
3.Pan S, Wang T, Qiu RL, Axente M, Chang C-W, Peng J, Patel AB, Shelton J, Patel SA, Roper J, et al. 2d medical image synthesis using transformer-based denoising diffusion probabilistic model. Phys Med & Biol. 2023;68(10):105004. [DOI] [PMC free article] [PubMed]
4.Wang T, Lei Y, Fu Y, Wynne JF, Curran WJ, Liu T, Yang X. A review on medical imaging synthesis using deep learning and its clinical applications. J Appl Clin Med Phys. 2021;22(1):11–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Salehinejad H, Colak E, Dowdell T, Barfett J, Valaee S. Synthesizing chest x-ray pathology for training deep convolutional neural networks. IEEE Trans Med Imag. 2018;38(5):1197–206. [DOI] [PubMed] [Google Scholar]
6.Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, Gatidis S, Yang B. Medgan: medical image translation using gans. Computerized Med Imag Graphics. 2020;79:101684. [DOI] [PubMed]
7.Amirrajab S, Al Khalil Y, Lorenz C, Weese J, Pluim J, Breeuwer M. Label-informed cardiac magnetic resonance image synthesis through conditional generative adversarial networks. Computerized Med Imag Graphics. 2022;101:102123. [DOI] [PubMed]
8.Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. International conference on machine learning. PMLR; 2017. pp. 214–23.
9.Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst. 2020;33:6840–51. [Google Scholar]
10.Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 10684–95.
11.Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst. 2021;34:8780–94. [Google Scholar]
12.Müller-Franzes G, Niehues JM, Khader F, Arasteh ST, Haarburger C, Kuhl C, Wang T, Han T, Nolte T, Nebelung S, et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci Rep. 2023;13(1):12098. [DOI] [PMC free article] [PubMed]
13.Moghadam PA, Van Dalen S, Martin KC, Lennerz J, Yip S, Farahani H, Bashashati A. A morphology focused diffusion probabilistic model for synthesis of histopathology images. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023, pp. 2000–09.
14.Javaid U, Lee JA. Capturing variabilities from computed tomography images with generative adversarial networks. arXiv preprint arXiv:1805.11504. 2018.
15.Woodland M, Wood J, Anderson BM, Kundu S, Lin E, Koay E, Odisio B, Chung C, Kang HC, Venkatesan AM, et al. Evaluating the performance of stylegan2-ada on medical images. In: International workshop on simulation and synthesis in medical imaging. Springer; 2022. pp. 142–53. [Google Scholar]
16.Pang T, Wong JHD, Ng WL, Chan CS. Semi-supervised gan-based radiomics model for data augmentation in breast ultrasound mass classification. Comput Met Programs Biomed. 2021;203:106018. [DOI] [PubMed]
17.Ding H, Huang N, Wu Y, Cui X. Legan: addressing intra-class imbalance in gan-based medical image augmentation for improved imbalanced data classification. IEEE Transactions on Instrumentation and Measurement. 2024.
18.Shin H-C, Tenenholtz NA, Rogers JK, Schwarz CG, Senjem ML, Gunter JL, Andriole KP, Michalski M. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. Simulation and Synthesis in Medical Imaging: Third International Workshop, SASHIMI 2018, Held in Conjunction with MICCAI 2018. Granada, Spain: Springer; 2018, pp. 1–11, September 16, 2018, Proceedings 3.
19.Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1125–34.
20.Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 2223–32.
21.Muramatsu C, Nishio M, Goto T, Oiwa M, Morita T, Yakami M, Kubo T, Togashi K, Fujita H. Improving breast mass classification by shared data with domain transformation using a generative adversarial network. Comput Biol Med. 2020;119:103698. [DOI] [PubMed]
22.Lopes L, Jiao F, Xue S, Pyka T, Krieger K, Ge J, Xu Q, Fahmi R, Spottiswoode B, Soliman A, et al. Dopaminergic pet to spect domain adaptation: a cycle gan translation approach. Eur J Nucl Med and Mol Imag. 2025;52(3):851–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ho J, Salimans T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. 2022.
24.Pinaya WH, Tudosiu P-D, Dafflon J, Da Costa PF, Fernandez V, Nachev P, Ourselin S, Cardoso MJ. Brain imaging generation with latent diffusion models. In: MICCAI workshop on deep generative models. Springer; 2022. pp. 117–26. [Google Scholar]
25.Dorjsembe Z, Odonchimed S, Xiao F. Three-dimensional medical image synthesis with denoising diffusion probabilistic models. Med Imag With Deep Learn. 2022.
26.Packhäuser K, Folle L, Thamm F, Maier A. Generation of anonymous chest radiographs using latent diffusion models for training thoracic abnormality classification systems. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). 2023, pp. 1–5. IEEE.
27.Akrout M, Gyepesi B, Holló P, Poór A, Kincső B, Solis S, Cirone K, Kawahara J, Slade D, Abid L, et al. Diffusion-based data augmentation for skin disease classification: impact across original medical datasets to fully synthetic images. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023, pp. 99–109.
28.Wang J, Wang K, Yu Y, Lu Y, Xiao W, Sun Z, et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat Med. 2025;31(2):609–17. [DOI] [PubMed] [Google Scholar]
29.Lyu Q, Wang G. Conversion between ct and mri images using diffusion and score-matching models. arXiv preprint arXiv:2209.12104. 2022.
30.Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. 2020.
31.Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. Advances in neural information processing systems. 2017;30.
32.Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference. Munich, Germany: Springer; 2015, pp. 234–41, October 5–9, 2015, Proceedings, Part III 18.
33.Meng X, Gu Y, Pan Y, Wang N, Xue P, Lu M, He X, Zhan Y, Shen D. A novel unified conditional score-based generative framework for multi-modal medical image completion. arXiv preprint arXiv:2207.03430. 2022.
34.Friedrich P, Wolleb J, Bieder F, Durrer A, Cattin PC. Wdm: 3d wavelet diffusion models for high-resolution medical image synthesis. In: MICCAI workshop on deep generative models. Springer; 2024. pp. 11–21. [Google Scholar]
35.Woo S, Park J, Lee J-Y, Kweon IS. Cbam: convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 3–19.
36.Hussain T, Shouno H, Mohammed MA, Marhoon HA, Alam T. Dcssga-unet: biomedical image segmentation with densenet channel spatial and semantic guidance attention. knowl-Based Syst. 2025;314:113233.
37.Peebles W, Xie S. Scalable diffusion models with transformers. IEEE/CVF International Conference on Computer Vision, ICCV 2023. Paris, France, Piscataway, NJ, USA: IEEE; 2023, pp. 4172–82 October 1–6, 2023.
38.Bao F, Nie S, Xue K, Cao Y, Li C, Su H, Zhu J. All are worth words: a vit backbone for diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Vancouver, BC, Canada, Piscataway, NJ, USA: IEEE; 2023, pp. 22669–79 June 17–24, 2023.
39.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image process. 2004;13(4):600–12. [DOI] [PubMed] [Google Scholar]
40.Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Salt Lake City, UT, USA, Piscataway, NJ, USA: Computer Vision Foundation / IEEE Computer Society; 2018, pp. 586–95 June 18–22, 2018.
41.Choi J, Lee J, Shin C, Kim S, Kim H, Yoon S. Perception prioritized training of diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. New Orleans, LA, USA: IEEE, ???; 2022:11462–71, June 18–24, 2022.
42.Huang Z, Zhou P, Yan S, Lin L. Scalelong: towards more stable training of diffusion model via scaling network long skip connection. Advances in Neural Information Processing Systems 36: annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023. New Orleans, LA, USA: 2023 December 10, 2023 - 16.
43.Kermany D, Zhang K, Goldbaum M, et al. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley Data. 2018;2(2):651. [Google Scholar]
44.Bińkowski M, Sutherland DJ, Arbel M, Gretton A. Demystifying mmd gans. arXiv preprint arXiv:1801.01401. 2018.
45.Parmar G, Zhang R, Zhu J-Y. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222 2021;5(14):6.
46.Detlefsen NS, Borovec J, Schock J, Jha AH, Koker T, Di Liello L, Stancl D, Quan C, Grechkin M, Falcon W. Torchmetrics-measuring reproducibility in pytorch. J Open Source Softw. 2022;7(70):4101. [Google Scholar]
47.Kynkäänniemi T, Karras T, Laine S, Lehtinen J, Aila T. Improved precision and recall metric for assessing generative models. Adv Neural Inf Process Syst. 2019.
48.Kinga D, Adam JB, et al. A method for stochastic optimization. International Conference on Learning Representations (ICLR). San Diego, California; 2015;5:6.
49.Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv Neural Inf Process Syst. 2022;35:5775–87. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The tongue data used in this study is not publicly available but can be obtained from the corresponding author upon reasonable request.

[CR1] 1.Allaouzi I, Ahmed MB. A novel approach for multi-label chest x-ray classification of common thorax diseases. IEEE Access. 2019;7:64279–88. [Google Scholar]

[CR2] 2.Kazerouni A, Aghdam EK, Heidari M, Azad R, Fayyaz M, Hacihaliloglu I, Merhof D. Diffusion models in medical imaging: a comprehensive survey. Med Image Anal. 2023;88:102846. [DOI] [PubMed]

[CR3] 3.Pan S, Wang T, Qiu RL, Axente M, Chang C-W, Peng J, Patel AB, Shelton J, Patel SA, Roper J, et al. 2d medical image synthesis using transformer-based denoising diffusion probabilistic model. Phys Med & Biol. 2023;68(10):105004. [DOI] [PMC free article] [PubMed]

[CR4] 4.Wang T, Lei Y, Fu Y, Wynne JF, Curran WJ, Liu T, Yang X. A review on medical imaging synthesis using deep learning and its clinical applications. J Appl Clin Med Phys. 2021;22(1):11–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Salehinejad H, Colak E, Dowdell T, Barfett J, Valaee S. Synthesizing chest x-ray pathology for training deep convolutional neural networks. IEEE Trans Med Imag. 2018;38(5):1197–206. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, Gatidis S, Yang B. Medgan: medical image translation using gans. Computerized Med Imag Graphics. 2020;79:101684. [DOI] [PubMed]

[CR7] 7.Amirrajab S, Al Khalil Y, Lorenz C, Weese J, Pluim J, Breeuwer M. Label-informed cardiac magnetic resonance image synthesis through conditional generative adversarial networks. Computerized Med Imag Graphics. 2022;101:102123. [DOI] [PubMed]

[CR8] 8.Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. International conference on machine learning. PMLR; 2017. pp. 214–23.

[CR9] 9.Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst. 2020;33:6840–51. [Google Scholar]

[CR10] 10.Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 10684–95.

[CR11] 11.Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst. 2021;34:8780–94. [Google Scholar]

[CR12] 12.Müller-Franzes G, Niehues JM, Khader F, Arasteh ST, Haarburger C, Kuhl C, Wang T, Han T, Nolte T, Nebelung S, et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci Rep. 2023;13(1):12098. [DOI] [PMC free article] [PubMed]

[CR13] 13.Moghadam PA, Van Dalen S, Martin KC, Lennerz J, Yip S, Farahani H, Bashashati A. A morphology focused diffusion probabilistic model for synthesis of histopathology images. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023, pp. 2000–09.

[CR14] 14.Javaid U, Lee JA. Capturing variabilities from computed tomography images with generative adversarial networks. arXiv preprint arXiv:1805.11504. 2018.

[CR15] 15.Woodland M, Wood J, Anderson BM, Kundu S, Lin E, Koay E, Odisio B, Chung C, Kang HC, Venkatesan AM, et al. Evaluating the performance of stylegan2-ada on medical images. In: International workshop on simulation and synthesis in medical imaging. Springer; 2022. pp. 142–53. [Google Scholar]

[CR16] 16.Pang T, Wong JHD, Ng WL, Chan CS. Semi-supervised gan-based radiomics model for data augmentation in breast ultrasound mass classification. Comput Met Programs Biomed. 2021;203:106018. [DOI] [PubMed]

[CR17] 17.Ding H, Huang N, Wu Y, Cui X. Legan: addressing intra-class imbalance in gan-based medical image augmentation for improved imbalanced data classification. IEEE Transactions on Instrumentation and Measurement. 2024.

[CR18] 18.Shin H-C, Tenenholtz NA, Rogers JK, Schwarz CG, Senjem ML, Gunter JL, Andriole KP, Michalski M. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. Simulation and Synthesis in Medical Imaging: Third International Workshop, SASHIMI 2018, Held in Conjunction with MICCAI 2018. Granada, Spain: Springer; 2018, pp. 1–11, September 16, 2018, Proceedings 3.

[CR19] 19.Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1125–34.

[CR20] 20.Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 2223–32.

[CR21] 21.Muramatsu C, Nishio M, Goto T, Oiwa M, Morita T, Yakami M, Kubo T, Togashi K, Fujita H. Improving breast mass classification by shared data with domain transformation using a generative adversarial network. Comput Biol Med. 2020;119:103698. [DOI] [PubMed]

[CR22] 22.Lopes L, Jiao F, Xue S, Pyka T, Krieger K, Ge J, Xu Q, Fahmi R, Spottiswoode B, Soliman A, et al. Dopaminergic pet to spect domain adaptation: a cycle gan translation approach. Eur J Nucl Med and Mol Imag. 2025;52(3):851–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Ho J, Salimans T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. 2022.

[CR24] 24.Pinaya WH, Tudosiu P-D, Dafflon J, Da Costa PF, Fernandez V, Nachev P, Ourselin S, Cardoso MJ. Brain imaging generation with latent diffusion models. In: MICCAI workshop on deep generative models. Springer; 2022. pp. 117–26. [Google Scholar]

[CR25] 25.Dorjsembe Z, Odonchimed S, Xiao F. Three-dimensional medical image synthesis with denoising diffusion probabilistic models. Med Imag With Deep Learn. 2022.

[CR26] 26.Packhäuser K, Folle L, Thamm F, Maier A. Generation of anonymous chest radiographs using latent diffusion models for training thoracic abnormality classification systems. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). 2023, pp. 1–5. IEEE.

[CR27] 27.Akrout M, Gyepesi B, Holló P, Poór A, Kincső B, Solis S, Cirone K, Kawahara J, Slade D, Abid L, et al. Diffusion-based data augmentation for skin disease classification: impact across original medical datasets to fully synthetic images. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023, pp. 99–109.

[CR28] 28.Wang J, Wang K, Yu Y, Lu Y, Xiao W, Sun Z, et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat Med. 2025;31(2):609–17. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Lyu Q, Wang G. Conversion between ct and mri images using diffusion and score-matching models. arXiv preprint arXiv:2209.12104. 2022.

[CR30] 30.Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. 2020.

[CR31] 31.Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. Advances in neural information processing systems. 2017;30.

[CR32] 32.Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference. Munich, Germany: Springer; 2015, pp. 234–41, October 5–9, 2015, Proceedings, Part III 18.

[CR33] 33.Meng X, Gu Y, Pan Y, Wang N, Xue P, Lu M, He X, Zhan Y, Shen D. A novel unified conditional score-based generative framework for multi-modal medical image completion. arXiv preprint arXiv:2207.03430. 2022.

[CR34] 34.Friedrich P, Wolleb J, Bieder F, Durrer A, Cattin PC. Wdm: 3d wavelet diffusion models for high-resolution medical image synthesis. In: MICCAI workshop on deep generative models. Springer; 2024. pp. 11–21. [Google Scholar]

[CR35] 35.Woo S, Park J, Lee J-Y, Kweon IS. Cbam: convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 3–19.

[CR36] 36.Hussain T, Shouno H, Mohammed MA, Marhoon HA, Alam T. Dcssga-unet: biomedical image segmentation with densenet channel spatial and semantic guidance attention. knowl-Based Syst. 2025;314:113233.

[CR37] 37.Peebles W, Xie S. Scalable diffusion models with transformers. IEEE/CVF International Conference on Computer Vision, ICCV 2023. Paris, France, Piscataway, NJ, USA: IEEE; 2023, pp. 4172–82 October 1–6, 2023.

[CR38] 38.Bao F, Nie S, Xue K, Cao Y, Li C, Su H, Zhu J. All are worth words: a vit backbone for diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Vancouver, BC, Canada, Piscataway, NJ, USA: IEEE; 2023, pp. 22669–79 June 17–24, 2023.

[CR39] 39.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image process. 2004;13(4):600–12. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Salt Lake City, UT, USA, Piscataway, NJ, USA: Computer Vision Foundation / IEEE Computer Society; 2018, pp. 586–95 June 18–22, 2018.

[CR41] 41.Choi J, Lee J, Shin C, Kim S, Kim H, Yoon S. Perception prioritized training of diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. New Orleans, LA, USA: IEEE, ???; 2022:11462–71, June 18–24, 2022.

[CR42] 42.Huang Z, Zhou P, Yan S, Lin L. Scalelong: towards more stable training of diffusion model via scaling network long skip connection. Advances in Neural Information Processing Systems 36: annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023. New Orleans, LA, USA: 2023 December 10, 2023 - 16.

[CR43] 43.Kermany D, Zhang K, Goldbaum M, et al. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley Data. 2018;2(2):651. [Google Scholar]

[CR44] 44.Bińkowski M, Sutherland DJ, Arbel M, Gretton A. Demystifying mmd gans. arXiv preprint arXiv:1801.01401. 2018.

[CR45] 45.Parmar G, Zhang R, Zhu J-Y. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222 2021;5(14):6.

[CR46] 46.Detlefsen NS, Borovec J, Schock J, Jha AH, Koker T, Di Liello L, Stancl D, Quan C, Grechkin M, Falcon W. Torchmetrics-measuring reproducibility in pytorch. J Open Source Softw. 2022;7(70):4101. [Google Scholar]

[CR47] 47.Kynkäänniemi T, Karras T, Laine S, Lehtinen J, Aila T. Improved precision and recall metric for assessing generative models. Adv Neural Inf Process Syst. 2019.

[CR48] 48.Kinga D, Adam JB, et al. A method for stochastic optimization. International Conference on Learning Representations (ICLR). San Diego, California; 2015;5:6.

[CR49] 49.Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv Neural Inf Process Syst. 2022;35:5775–87. [Google Scholar]

PERMALINK

MedIENet: medical image enhancement network based on conditional latent diffusion model

Weizhen Yuan

Yue Feng

Tiancai Wen

Guancong Luo

Jiexin Liang

Qianshuai Sun

Shufen Liang

Abstract

Background

Methods

Results

Conclusion

Introduction

Related work

Data augmentation based on generative models

Attention mechanism

Method

Fig. 1.

Variational autoencoder (VAE)

Multi-attention diffusion Model

Diffusion process

Reverse diffusion process

Multi-attention denoising U-Net module

Fig. 2.

Fig. 3.

Experiments

Datasets

Evaluation criteria

Implementation details

Quantitative comparison on generated images

Fig. 4.

Fig. 5.

Fig. 6.

Table 1.

Fig. 7.

Ablation study

Table 2.

Downstream classification experiment

Table 3.

Discussion

Fig. 8.

Conclusion

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Conflict of interest

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases