Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2022 May 14;221:106883. doi: 10.1016/j.cmpb.2022.106883

WVALE: Weak variational autoencoder for localisation and enhancement of COVID-19 lung infections

Qinghua Zhou a, Shuihua Wang a, Xin Zhang b,, Yu-Dong Zhang a,
PMCID: PMC9107178  PMID: 35597203

Abstract

Background and objective: The COVID-19 pandemic is a major global health crisis of this century. The use of neural networks with CT imaging can potentially improve clinicians’ efficiency in diagnosis. Previous studies in this field have primarily focused on classifying the disease on CT images, while few studies targeted the localisation of disease regions. Developing neural networks for automating the latter task is impeded by limited CT images with pixel-level annotations available to the research community.

Methods: This paper proposes a weakly-supervised framework named “Weak Variational Autoencoder for Localisation and Enhancement” (WVALE) to address this challenge for COVID-19 CT images. This framework includes two components: anomaly localisation with a novel WVAE model and enhancement of supervised segmentation models with WVALE.

Results: The WVAE model have been shown to produce high-quality post-hoc attention maps with fine borders around infection regions, while weak supervision segmentation shows results comparable to conventional supervised segmentation models. The WVALE framework can enhance the performance of a range of supervised segmentation models, including state-of-art models for the segmentation of COVID-19 lung infection.

Conclusions: Our study provides a proof-of-concept for weakly supervised segmentation and an alternative approach to alleviate the lack of annotation, while its independence from classification & segmentation frameworks makes it easily integrable with existing systems.

Keywords: Anomaly localisation, Weakly supervision, Pseudo data, Segmentation

1. Introduction

Since December 2019, the coronavirus COVID-19, also known as SARS-CoV-2, has infected more than 250 million people worldwide. It is one of the most significant health crises of this century. Efficient testing and quarantine are essential to control transmission rates. Computed tomography (CT) scans is an essential diagnostic tool and viable alternative to RT-PCR in regions of limited resources. CT can be used as an additional test to screen for PCR false negatives, as multiples studies have shown CT imaging to have higher sensitivity than PCR tests [1]. However, CT scans require visual diagnosis by radiologists. The CT diagnosis workload would also likely increase in regional hot spots, adding strain to the current medical systems. A possible solution to this problem is deep learning - a branch of AI that utilises neural networks as universal approximators. The advantage of deep learning algorithms is their efficiency, as it only requires seconds to minutes to provide an output. These algorithms can aid radiologists in the screening of COVID-19 and reduce requirements in the level of expertise.

Recent deep learning studies have shown significant advancements in COVID-19 classification performance, e.g. A deep neural network COVID-Net has been designed by Wang, et al. [2] to classify COVID-19 from chest X-ray images and has shown up to 91% sensitivity. CT scans, composed of multiple X-rays, can provide more spatial information than a single chest X-ray. Harmon, et al. [3] applied a 3D CNN which achieved up to 90.8% accuracy and 84% sensitivity for multinational datasets. Most of these classification-oriented studies have either directly applied Grad-CAM [3] or its variants to generate post-hoc attention maps for interpretation [4], while few focused on segmentation and localisation of disease regions from COVID-19 positive images. Fan, et al. [5] proposed Inf-Net, which utilises reverse attention and explicit edge-attention modules to segmentation diseased regions of the lung. A weakly supervised model based on U-Net was proposed with an anomaly localisation component, which combines the U-Net’s CAM activation with lung regions extracted with 3D connected component (3DCC) method to form localisation results [6]. These studies are often limited by the available number of samples and annotations. This challenge is especially severe for segmentation, where only a few limited datasets with detailed annotations are accessible.

One possible solution is the use of synthetic data. A recent study proposed a label-free segmentation solution (NormNet) which uses synthetic COVID-19 positive CT images generated by introducing synthetic lesions to healthy control CT images, as training data for the nnU-Net model [7]. Another possible solution is weakly supervised learning with autoencoders, which only requires training data from the healthy control class. The underlying principle is learning the latent representation or distribution of the healthy control data through transformation and recovery of the original data. Anomaly localisation is usually performed through three categories: reconstruction, restoration and gradient-based methods [8]. Reconstruction identifies anomalies as regions of the images that the model cannot accurately reconstruct; restoration manipulates the latent space to create a healthy variant of the diseased image and identifies anomalies on the residual map; gradient-based methods rely on gradients obtained through back-propagation, mapped to the input as saliency maps [8].

This study proposes a weakly supervised framework (WVALE) for anomaly localisation of COVID-19 CT scans and its extension to address the lack of pixel-annotated data. This framework consists of two components: (1) localisation and (2) enhancement. The localisation component consists of a variant of the context-encoding variational autoencoder combined with a recently introduced gradient-based technique for improving visual explainability [9]. The gradient-based technique exploits the range of latent distributions learned from the data to generate multiple post-hoc attention maps. In the enhancement component, the combination of these attention maps provides a means to localise infected regions in lung CT and generate pseudo-samples. These pseudo-samples can be utilised to enhance the supervised segmentation of lung infection from COVID-19 positive CT images. Our contributions include:

  • (1)

    A weakly-supervised model (WVAE) for anomaly localisation with performance comparable to multiple supervised segmentation models.

  • (2)

    The application of a new gradient-based technique for variational autoencoders in localisation of COVID-19 lung infection regions on CT images.

  • (3)

    Use of post-hoc attention maps to generate pseudo segmentation datasets for images without pixel-level annotations.

  • (4)

    A framework (WVALE) to utilise the above contributions for anomaly localisation and enhancement of supervised segmentation models.

The following sections will be organised as follows: Section 2 introduces the methodology employed in this study, Section 2.4 details the datasets used in this study and the preprocessing pipeline, Section 3 provides experimentation results for anomaly localisation and segmentation performance enhancement.

2. Methods

Under our framework WVALE, we propose a neural network model named WVAE for weakly supervised anomaly localisation of COVID-19 CT images and performance enhancement of supervised segmentation models. The proposed model’s structure is based upon a variant of the context encoding VAE [10], and the gradient-based interpretation method for one-class explainable VAE [9]. The proposed model shows promising performance in anomaly localisation with small model size and limited training data. The following subsections will introduce the structure (Section 2.1) and anomaly localisation technique (Section 2.2) for WVAE; and the WVALE framework for performance enhancement (Section 2.3).

2.1. Variational autoencoder structure

Autoencoders are a classical neural network architecture composed of two main components: encoder and decoder. The encoder fe maps the original input into a latent representation zi, which the decoder fd attempts to reconstruct the original input. A comparison between the reconstructed sample xi and our original input xi provides a performance measure known as the reconstruction loss L. Back-propagation can be performed from the loss to update model weights. A probabilistic variation of AE is known as variational autoencoders (VAE), which attempts to learn the training data’s latent distributions instead of latent representations z. For VAE, a single sample of available data x can be interpreted as a random sample from the true distribution of data p(x), while the encoder can be represented as the marginal distribution q(zx), an approximation to the true marginal distribution. The loss function is therefore

L=L1(x,x)+LKL(q(zx),p(z)) (1)

where L1 is the reconstruction loss between the original data and the reconstructed data x; LKL is the Kullback-Leibler divergence which regularises the VAE and enforces the Gaussian prior p(z)=N(0,1). Through this adjustment, AE learns latent variable distributions instead of representations. A more intuitive formulation is as follows

μ=f1(fe(x))andσ=f2(fe(x)) (2)

where f1 and f2 represents two independent fully connected layers that map the features from the last convolutional layer to two vectors representing μ and σ the mean and variance of the latent distributions. The latent representation can be sampled through reparameterisation,

z=μ+σɛwhereεN(0,1) (3)

and then decoded to reconstruct the input x=fd(z). Variational autoencoder has been applied in deep learning studies in medical images [11]. The VAE variant known as context-encoding VAE includes an additional task in the training procedure, reconstructing missing regions of secondary input of x^, which has a patch of the image removed. Our weakly supervised model (WVAE) is constructed based upon context-encoding VAE and can therefore be considered one of its variants. This choice of the base model is selected via exploratory experiments with weakly-supervised methods. We found that VAE, context-encoders (CE) and context-encoding VAE all showed potential for anomaly localisation. For more rigid experimentation, see Section 3.2.

The introduction of a perturbed secondary input x^ with the task of restoring the original image is the core idea of context-encoding VAE. The perturbation does not need to be constrained to simple patch removal or a single perturbation. For WVAE, we utilise a perturbation combination of:

  • Translation of range 20 pixels in both x and y-axis

  • Patch-removal constrained within the lung regions of each specific CT slice

  • Mix-up, where an random image xr from healthy control is selected and overlaid upon the target image xt to generate a perturbed image xp:
    xp=βxt+(1β)xr
    where β=B(1,1)=0.459 is drawn from the Beta distribution.

The perturbation combination is selected based on the proxy task of reconstruction from various image augmentation methods, including pixel-mixing, non-linear intensity transformations, and various blurring or nosing methods. Similar to context-encoding VAEs, the additional task of restoring perturbed x^ can also be represented as an additional loss component L1(x,x^) which measures restoration of context in the loss function,

L=L1(x,x)+LKL(q(zx),p(z))+L1(x,x^) (4)

where x^ is the reconstructed output of the secondary input, generated using z^=μ^=f1fe(x^). WVAE, with its combined perturbation, has been found to improve performance in unsupervised anomaly detection [10]. This study trains WVAE with a shallow convolutional neural network backbone. The neural network is trained with CT slices of healthy control subjects only for weakly supervised learning.

2.2. Gradient-based anomaly localisation

Another component of WVAE is anomaly localisation based on a variation of gradient-weighted class activation mapping, one of the most commonly used techniques for post-hoc visualisation of neural network attention [12]. Its dependency on gradient information alleviates the dependency of CAM on neural network structure and allows the visualisation of feature maps of various layers throughout the neural network. It can be formulated as,

Mc=ReLU(1zk=1nm=1hn=1wycAkmnAk) (5)

where Mc is the attention map for class c with a predicted score of yc, Ak is the k-th feature map of a chosen convolutional layer, (h,w) are dimensions of the feature map, whereas summation over these dimensions corresponds to global average pooling. The one-class explainable VAE [9] replaces the partial derivative of the predicted score with reparametrised representations zi for i=1,,n.

Mi=ReLU(1hwk=1nm=1hn=1wziAkmnAk) (6)

This technique exploits the range of n latent distributions to produce multiple attention maps to produce n redundant attention maps Mi. These maps can be aggregated to a single post-hoc attention map M=n1iMi. We integrate this technique into WVAE for anomaly localisation. The schematics of this technique with the variational autoencoder structure are shown in Fig. 1 . Given a COVID-19 positive slice, the infection regions can cause a large deviation from the normal distribution learned from the control slices, which will present as concentrated regions on the post-hoc attention maps, as shown in Fig. 2 (a).

Fig. 1.

Fig. 1

Schematics of WVAE which combines a context-encoding variational autoencoder variant and a gradient-based technique that utilises the range of latent distributions learned by the VAE for anomaly localisation.

Fig. 2.

Fig. 2

Examples of anomaly localisation results for 6 CT images, including both images with Covid-19 infection (rows 1–3) and images from healthy controls (rows 4–6). From left to right: preprocessed CT image, element-wise post-hoc attention maps of four example latent distributions zi, i=1,,4, aggregated attention maps, segmentation output and ground truth infection masks. For attention maps, warmer colours indicate regions of anomalies. The slices of healthy controls in Fig. 2 are presented for qualitative evaluation of errors in anomaly localisation and are not included in the training or evaluation of subsequent supervised segmentation methods.

2.3. Localisation and segmentation enhancement

The anomaly localisation with VAE can be easily extended to weakly supervised segmentation by applying a threshold on the aggregated post-hoc attention maps to generate binary outputs (i.e., segmentation outputs). This study separated a subset of images from a segmentation dataset with pixel-level annotations to calculate this threshold. This threshold can then be used to generate binary segmentation outputs for the remaining images, consequently used to calculate segmentation metrics with corresponding pixel-level annotation maps. This approach provides a quantitative approach to measure the performance of anomaly localisation while clinicians perform the qualitative evaluation.

Another intuitive extension to the use of WVAE is the generation of pseudo segmentation samples from images without pixel-level annotations. Segmentation models can integrate these pseudo samples into the training procedure to enhance the performance of supervised segmentation models. Under the framework of WVALE, we experiment with two approaches to utilise the pseudo samples produced by WVAE: pre-training and re-training. In the pre-training approach, the pseudo images are first mixed with original segmentation samples to pre-training the segmentation model; then, the pre-trained model is fine-tuned on the original segmentation samples. The same procedure applies for the re-training method, where instead of starting from a randomly initialised model, we utilised base models already trained with the original segmentation samples.

An important note for these approaches is that, instead of calculating the threshold based on the segmentation dataset, we apply a hyper-intensity prior to the aggregated attention maps to generate the pseudo segmentation masks. This procedure prevents even the slightest contamination from the threshold parameter. We performed both the pre-training and re-training procedure on a range of supervised segmentation models, including state-of-art models for COVID-19 lung infection region segmentation.

2.4. Data and preprocessing

2.4.1. Main dataset

This study’s main dataset was acquired from patient admissions and healthy medical examiners of Fourth Peoples Hospital of Huaian City. The set of CT scans includes 66 patients diagnosed with COVID-19 (mean age, 49.4814.71 years; range, 23–91 years, male 44, female 22) and 66 healthy medical examiners as controls (mean age, 38.44 10.58 years; range, 25–72 years, male 38, female 28). Consent was acquired from all participants. Positive and negative NA tests are used as inclusion criteria for patients and controls, respectively. Demographics of the subjects are also listed in Table 1 . Imagining specifications are provided in the Appendix. The CT imaging of all patients and participants was performed using a Philips Ingenuity 64 row spiral CT machine with parameters specified in the Appendix. The patients were placed in a supine position and performed deep aspiration after self-breath-holding. The scan includes the region from the lung tip to the costal diaphragm angle.

Table 1.

Demographics of main dataset.

Healthy controls COVID-19 patients
Male Subjects 38 44
Female Subjects 28 22
Mean Age (years) 38.44 ± 10.58 49.48 ± 14.71
Age Range (years) 25–72 23–91
Slices 148 148

For each scan of a COVID-19 patient, four slices were chosen by radiologists to encompass regions of lesions. Random slices were chosen for controls. Annotations were performed at slice-level by two junior radiologists and one senior radiologist, while majority voting was used to determine final labels. The main dataset is split into three sets: training, validation and test. The slices from the 51 healthy controls of the mains dataset are only utilised for training of the weakly-supervised models. Validation and test sets were split prior to training for hyper-parameter optimisation and independent evaluation. The subject-level split for the test set is used to prevent model bias towards the anatomies of individual subjects. This test set provided by the main dataset allows for qualitative evaluation of anomaly segmentation.

2.4.2. Segmentation dataset

Segmentation in this study was performed on an open-access COVID-19 CT segmentation dataset.1 The dataset consists of 100 axial CT images from more than 40 COVID-19 patients. The source of these images is open access images collected by the Italian Society of Medical and Interventional Radiology. Radiologists segmented these images to identify anomaly regions indicative of lung infections: ground class opacifications, consolidations, and pleural effusions. It is one of a few datasets with pixel-level annotations available.

For two different sections of the WVALE framework, localisation and enhancement, the dataset is split in two different ways for two different methods of evaluation. (1) In the quantitative evaluation of WVAE for anomaly localisation, we split the dataset into two subsets to mimic the hold-out in segmentation studies. 50% of the dataset is used as the training set for supervised comparative studies and to determine the threshold for weakly supervised methods; the other 50% of this dataset is used as an independent test set for quantitative analysis. (2) In evaluating performance enhancement on supervised segmentation models, 5-fold cross-validation is utilised to provide a more stable performance comparison. Therefore, the segmentation dataset is split into 5-folds. For each fold in training, one fold is used as the test set, and the other four comprise the training set;5 images from the training set are separated for validation and optimisation.

2.4.3. Preprocessing

We performed lung segmentation to isolate the lung region to provide high-quality information to WVAE and comparative methods. For the main dataset, images were first converted to grey-scale and then cropped to the body region. Lung segmentation was performed with a 2D U-Net [13] trained on the NSCLC dataset [14] and tested on the MSD dataset [15]. Our lung segmentation model achieved 0.936 Dice, 95.0% sensitivity and 99.3% specificity, 0.012 MAE on the independent MSD dataset. Lung segmentation masks generated by the neural network are refined with another automated algorithm that combined object extraction of the left and right lung and removing small objects and objects close to the border of images. For the segmentation dataset, lung segmentation masks are included in the dataset provided by Hofmanninger, et al. [16]. The masks are overlaid upon corresponding images to isolate the lung regions. The resolution of slices in both datasets is 512 by 512 pixels; we all resized slices to 224 by 224 pixels to reduce computational cost. Histogram stretching was then performed through min-max scaling on each image to enhance contrast. The entire preprocessing procedure is independent for each image.

3. Results and discussion

3.1. Qualitative analysis of anomaly localisation

We performed anomaly localisation on the available datasets with the variational autoencoder structure and attention-map generation technique introduced in Sections 2.1 and 2.2. The neural network architecture for WVAE is chosen based on optimal validation loss in hyper-parameter optimization. The model has a shallow architecture, with two Conv2D layers in the encoder, each with 4-by-4 pixel kernels and strides of 2. The decoder structure is designed with transposed Conv2D layers to mirror the structure of the encoder, followed by a sigmoid activation. A set of 16 latent distributions provides 16 element-wise attention maps in the application of the gradient-based technique. The model is trained with an Adam optimizer with a learning rate of 1×103.

Example anomaly localisation results for a few latent distributions are shown in Fig. 2. The latent distributions show varying degrees of focus on the same anomaly regions. For images with COVID-19 infections, the regions of attention localised on the aggregated attention map are consistent with the pathology of COVID-19 and encompass areas of ground-glass opacities and consolidations. We attribute this to the aggregation process, where it combines information and reduces noise from the attention-maps of each latent distribution to produce better anomaly localisation. For images from healthy controls, our model has mainly localised the pulmonary artery and veins. Although the model can not differentiate between these anatomical structures from lung infection regions, clinicians can easily recognise these regions.

For qualitative comparison between reconstruction (rc), restoration (rs) and gradient-based (g) techniques [8], we construct VAEs and implement these techniques for comparison. The segmentation maps generated for four random image samples with ground truth segmentation masks are shown in Fig. 3 . Visualisation of segmentation indicates that the gradient-based technique provides better overall segmentation results. For images with COVID-19 infections, the gradient-based techniques provide more definitive boundaries than reconstruction, while the restoration techniques failed to localise infection regions. For images from healthy controls, the gradient-based techniques have more focused localisations on the blood vessels. Reconstruction and restoration techniques produce more scattered tiny regions of incorrect localisations. The incorrect localisations produced by the gradient-based techniques are significantly easier to distinguish in the application with its link to anatomy. This property is essential in preventing additional workloads for clinicians, a common flaw amongst software diagnostic aids based on neural networks.

Fig. 3.

Fig. 3

Qualitative comparison of anomaly localisation results for reconstruction, restoration gradient-based methods for VAEs.

3.2. Quantitative evaluation of anomaly localisation

For quantitative analysis of the performance of the proposed model for anomaly localisation, we extend the application to the segmentation of infected lung regions as mentioned in Section 2.2. We implemented hyper-intensity priors of 0.9 quantiles and performed median filtering on the residual or post-hoc attention maps. For weakly supervised methods, we transformed the attention maps to binary segmentation outputs by finding the threshold corresponding to the highest DICE score in the segmentation training set. For quantitative evaluation, we utilise the segmentation metrics of Dice coefficient (Dice), sensitivity (SEN), specificity (SPE), and mean absolute error (MAE) to compare the weakly supervised segmentation to established and state-of-art segmentation methods. We also include the best dice coefficient [Dice], calculated through greedy-search optimisation of threshold on the test set. All metrics are calculated within the segmented lung volume. All results are calculated on the segmentation test set of COVID-19 positive images to provide comparable metrics to segmentation studies.

Quantitative results of comparison between reconstruction, restoration and gradient-based techniques for variational autoencoders can be found in Table 2 . Our model WVAE achieved 0.559 in Dice, 0.596 in [Dice], 66.7% SEN, 96.2% SPE and 0.071 MAE. Both WVAE and the gradient-based technique on standard VAE significantly outperforms reconstruction and restoration techniques. The integration of context-encoding architecture for anomaly localisation showed no qualitative difference in performance compared to the standard variational autoencoder counter-part. However, the context-encoding architecture showed highers quantitative performance compared to the standard variational autoencoder.

Table 2.

Quantitative comparison of weakly-supervised anomaly localisation techniques for variational autoencoders.

Methods DICE [DICE] SEN SPE MAE
VAE (rc) 0.305 0.349 0.286 0.769 0.358
VAE (rs) 0.130 0.162 0.124 0.632 0.498
VAE (g) 0.541 0.581 0.638 0.962 0.073
WVAE (g) 0.559 0.596 0.667 0.962 0.071

With a similar approach as Yao, et al. [7], we compare WVAE with a range of weakly-supervised anomaly localisation methods based on a range of different models, including basic autoencoder, context encoder [17], constrained VAE, context-encoding VAE [10], f-AnoGAN [18], AnoVAEGAN [19], Gaussian mixture VAE [20] and a variant of Gaussian mixture VAE with a spatial bottleneck [21]. The quantitative results for anomaly localisation methods can be found in the bottom section of Table 3 , while example images can be found in Fig. 4 . The gradient-based method with standard variational autoencoder, VAE (g), outperforms all comparative weakly supervised anomaly localisation methods. These finding is further validated by the higher performance of the two gradient-based methods in all quantitative metrics. Our model, WVAE outperforms VAE (g) with higher performance in all metrics.

Table 3.

Quantitative results - supervised and weakly-supervised.

Models DICE [DICE] SEN SPE MAE
U-Net [13] 0.534 0.624 0.561 0.952 0.072
U-Net+ [22] 0.554 0.644 0.607 0.950 0.071
FPN [23] 0.466 0.573 0.544 0.939 0.086
LinkNet [24] 0.500 0.591 0.565 0.938 0.083
PSPNet [25] 0.462 0.579 0.538 0.942 0.084
Inf-Net [5] 0.674 0.729 0.655 0.975 0.049
NormNet [7] 0.698 0.633
Mini-Seg [26] 0.730 0.732 0.753 0.972 0.041
AE (rc) 0.428 0.446 0.449 0.817 0.291
CE (rc) 0.384 0.403 0.379 0.822 0.304
Constrained VAE (rc) 0.418 0.460 0.407 0.840 0.282
Context VAE (rc) 0.205 0.243 0.176 0.806 0.362
GMVAE (rc) 0.097 0.136 0.079 0.889 0.345
GMVAE spatial (rc) 0.096 0.124 0.094 0.626 0.512
GMVAE spatial (rs) 0.067 0.109 0.060 0.889 0.361
f-AnoGAN 0.227 0.262 0.253 0.655 0.458
AnoVAEGAN 0.299 0.332 0.290 0.749 0.379
VAE (rc) 0.305 0.349 0.286 0.769 0.358
VAE (rs) 0.130 0.162 0.124 0.632 0.498
VAE (g) 0.541 0.581 0.638 0.962 0.073
WVAE (g) 0.559 0.596 0.667 0.962 0.071

Fig. 4.

Fig. 4

Qualitative comparison of anomaly localisation for various models with best dice [Dice] > 0.2. The same example images (three with Covid-19 Infection and three from healthy controls) are used in this figure. Reconstruction (rc), restoration (rs) and gradient-based methods (g) used for each model are indicated. The result of our model (WVAE) and the ground truth is included in the final two columns.

We also compared the performance of the weakly supervised WVAE to supervised segmentation models. These models are constrained to the task of segmenting lung infection regions in COVID-19 positive CT slices. Therefore, no slices of healthy controls are involved in the training or evaluation of these models.

These supervised segmentation models include both established models and COVID-19 specific state-of-art segmentation models. For established models, we trained and evaluated U-Net [13], U-Net++ [22], Feature Pyramid Network (FPN) [23], LinkNet [24] and Pyramid Scene Parsing Network (PSPNet) [25].2 Apart from these models, there are also models designed specificially for COVID-19 segmentation tasks, including Inf-Net [5], NormNet [7] and MiniSeg [26].3

All models apart from NormNet are trained and evaluated in our 5-fold cross-validation split of the same CT segmentation dataset. The segmentation results of these models can be found in the top section of Table 3. Although WVAE is not comparable to state-of-art supervised segmentation methods (Inf-Net, NormNet and MiniSeg), it still outperformed FPN, LinkNet and PSPNet; and achieves comparable performance to U-Net, U-Net++. For the supervised segmentation models, exposure to the target information contained within the COVID-19 positive slices and ground truth segmentation masks provides an unfair advantage compared to the weakly-supervised models, which are only trained with healthy control slices. Therefore the ability of WVAE to outperform or achieve similar performance to some of these supervised models further validates the efficacy of the proposed model for anomaly localisation. This efficacy provides a basis for the enhancement of supervised segmentation models.

3.3. WVALE to enhance supervised segmentation

For enhancement of supervised segmentation, we generate a pseudo segmentation dataset based on the main dataset with WVAE generated aggregated attention maps. To ensure independence of the pseudo dataset, the threshold applied to generate segmentation outputs is only based on a 0.9 hyper-intensity prior. Selecting only the samples from COVID-19 patients, this process provides an additional 148 images with pseudo pixel-annotations.

Prior to the re-training procedure, we trained and evaluated a broader range of supervised segmentation models via 5-fold cross-validation. This range of models includes previously mentioned established and COVID-19 specific models. NormNet [7] is not included due to a lack of source code. Transformer-based models that incorporate the attention mechanism are the current state-of-art technique for high performing segmentation models.4 Therefore, for a broader picture of supervised segmentation models for the enhancement task, we have also included models of TransUNet [32], SwinUNet [31], UTNet and ResNet-UTNet [30].

For established segmentation models, all models are trained for 50 epochs with Dice loss and Adam optimiser with a learning rate of 1×103. For COVID-19 specific models of InfNet and MiniSeg, we utilised the same hyperparameters, training and inference procedure as the original studies and corresponding source codes. The default hyper-parameters and training settings do not lead to adequate learning for all models for transformer-based models. Therefore the hyper-parameters for each transformer-based model are manually tuned on a validation set split from the training set of each fold. These hyperparameters are then utilised as the base settings for later enhancement.

The performance of these supervised segmentation models is shown in Table 4 , providing us with a baseline for WVALE enhancement with either the pre-training or the re-training method. These models are also utilised as the basis for re-training, with the procedures described in Section 2.3. The segmentation results for the pre-trained (P) and re-trained (R) models are shown in Tables 5 and 6 , respectively.

Table 4.

Performance of supervised segmentation models in 5-fold cross-validation.

Models DICE [DICE] SEN SPE MAE
FPN [23] 0.434 0.577 0.516 0.950 0.079
PAN [27] 0.496 0.611 0.567 0.949 0.074
DeepLabV3 [28] 0.479 0.599 0.520 0.954 0.074
DeepLabV3+ [29] 0.507 0.611 0.610 0.941 0.080
PSPNet [25] 0.553 0.646 0.573 0.961 0.065
LinkNet [24] 0.556 0.654 0.628 0.958 0.065
ResNet-UTNet[30] 0.592 0.671 0.650 0.966 0.058
U-Net [13] 0.633 0.718 0.675 0.969 0.052
U-Net+ [22] 0.643 0.720 0.667 0.972 0.051
SwinUNet [31] 0.653 0.730 0.684 0.969 0.049
UTNet [30] 0.678 0.747 0.665 0.982 0.042
TransUNet [32] 0.690 0.744 0.703 0.980 0.044
Inf-Net [5] 0.688 0.754 0.708 0.976 0.043
Mini-Seg [26] 0.735 0.784 0.815 0.967 0.041

Table 5.

Performance of WVALE pre-trained supervised segmentation models in 5-fold cross-validation

Models DICE [DICE] SEN SPE MAE
FPN (P) 0.452 0.591 0.525 0.969 0.958
PAN (P) 0.426 0.589 0.492 0.971 0.958
DeepLabV3 (P) 0.430 0.583 0.494 0.969 0.960
DeepLabV3+ (P) 0.469 0.609 0.538 0.967 0.957
PSPNet (P) 0.551 0.681 0.612 0.974 0.968
LinkNet (P) 0.542 0.647 0.634 0.975 0.958
ResNet-UTNet (P) 0.607 0.676 0.626 0.978 0.055
U-Net (P) 0.635 0.717 0.690 0.976 0.971
U-Net+ (P) 0.666 0.730 0.722 0.983 0.971
SwinUNet (P) 0.656 0.730 0.724 0.970 0.049
UTNet (P) 0.701 0.753 0.693 0.981 0.043
TransUNet (P) 0.659 0.715 0.649 0.982 0.048
Inf-Net (P) 0.703 0.762 0.727 0.979 0.041
MiniSeg (P) 0.740 0.809 0.868 0.967 0.038

Blue for increase and Red for decrease in performance via re-training

Table 6.

Performance of WVALE re-trained supervised segmentation models in 5-fold cross-validation

Models DICE [DICE] SEN SPE MAE
FPN (R) 0.553 0.693 0.612 0.969 0.049
PAN (R) 0.567 0.702 0.611 0.971 0.047
DeepLabV3 (R) 0.563 0.708 0.608 0.969 0.047
DeepLabV3+ (R) 0.597 0.718 0.665 0.967 0.047
PSPNet (R) 0.632 0.753 0.681 0.974 0.040
LinkNet (R) 0.685 0.777 0.739 0.975 0.036
ResNet-UTNet (R) 0.630 0.696 0.684 0.973 0.053
U-Net (R) 0.735 0.805 0.794 0.976 0.032
U-Net+ (R) 0.791 0.838 0.834 0.983 0.025
SwinUNet (R) 0.654 0.733 0.697 0.974 0.046
UTNet (R) 0.697 0.754 0.678 0.983 0.042
TransUNet (R) 0.659 0.720 0.656 0.983 0.047
Inf-Net (R) 0.693 0.755 0.728 0.979 0.042
MiniSeg (R) 0.750 0.797 0.857 0.966 0.041

Blue for increase and Red for decrease in performance via re-training

From Table 5 we can observe clear performance improvement in all metrics for several segmentation models, including FPN, SwinUNet, and especially the COVID-19 specific SOTA models of Inf-Net and MiniSeg. For models such as U-Net++, UTNet and ResNet-UTNet, an increase in DICE and [DICE] are evident with a clear trade-off between sensitivity and specificity. This trade-off is also present for models such as PAN, DeepLabV3, DeepLabV3+ and TransUNet, while accompanied by a decrease in DICE and [DICE]. These observations validate WVALE pre-training as a potential enhancement method. While the approach may not provide an overall performance increase for all methods, it can also be utilised to adjust performance to specific metrics, e.g. high specificity for higher performance in isolating local infection regions for further analysis.

For the re-training approach, we can observe from Table 6 overall improvement in all metrics for most models apart from TransUNet and MiniSeg, while the decrease in specificity for MiniSeg is below 0.0001. This improvement of the re-trained models over the original models in 5-fold cross-validation validates the benefits of the additional pseudo segmentation samples generated from the main dataset. The results demonstrate the enhancement capabilities of our WVALE framework for established models like UNet++, transformer-based segmentation models like UTNet, and COVID-19 specific SOTA segmentation models like InfNet. The enhancement can push established models to SOTA performance: the overall highest performing model after enhancement is U-Net++ with 0.791 Dice, 0.838 [Dice], 83.4% SEN, 98.3% SPE, 0.025 MAE. This re-trained U-Net++ achieved the highest DICE, [DICE], SPE and lowest MAE of all baseline, pre-trained and re-trained segmentation models, further validating the effectiveness of this approach. The WVALE framework with the weakly-supervised WVAE model and the pre-training & re-training approaches can all be easily extended to other datasets.

4. Conclusions and future work

This study proposes a deep learning framework, WVALE, for weakly supervised localisation of COVID-19 lung infection regions and performance enhancement of supervised segmentation models. The proposed WVAE achieved high qualitative and quantitative performance in the weakly supervised localisation of the lung infection regions. It outperforms or is comparable to a range of supervised segmentation methods. The proposed WVALE framework has shown the capability to enhance the performance of both conventional and state-of-art COVID-19 segmentation models through the generation and utilisation of pseudo annotations. Our approaches can potentially be integrated into most existing supervised segmentation or classification frameworks. In the future, we will attempt to build ad hoc explainability into the training process to generate inherently explainable neural networks and utilise them further improve the performance of supervised segmentation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This paper is partially supported by Royal Society International Exchanges Cost Share Award, UK (RP202G0230); Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); Hope Foundation for Cancer Research, UK (RM60G0680); British Heart Foundation Accelerator Award, UK (AA/18/3/34220); Sino-UK Industrial Fund (RP202G0289); Global Challenges Research Fund (GCRF), UK (P202PF11); LIAS Pioneering Partnerships award, UK (P202ED10); Data Science Enhancement Fund, UK (P202RE237 ).

Footnotes

1

The source segmentation dataset can be found via http://medicalsegmentation.com/covid19/.

2

For segmentation models, we implemented the open-source package found in https://github.com/qubvel/segmentation_models.pytorch

3

For COVID-19 specific segmentation models, we implemented the open source packages of InfNet (https://github.com/DengPingFan/Inf-Net) and MiniSeg (https://github.com/yun-liu/MiniSeg

4

For transformer-based segmentation models, we implemented the open-source package found in https://github.com/yhygao/UTNet

Appendix A

CT imaging for the main dataset - Philips Ingenuity 64 row spiral CT machine: KV: 120, MAS: 240, layer thickness 3 mm, layer spacing 3 mm, screw pitch 1.5. Lung window (W: 1500, L: 500), Mediastinum window (W: 350, L: 60), thin layer reconstruction according to the lesion display, layer thickness and layer distance are 1mm. The patients were placed supine, breathing deeply after holding in, and conventionally scanned from the lung tip to the costal diaphragm angle.

References

  • 1.Fang Y., Zhang H., Xie J., Lin M., Ying L., Pang P., Ji W. Sensitivity of chest CT for COVID-19: comparison to RT-PCR. Radiology. 2020;296(2):E115–E117. doi: 10.1148/radiol.2020200432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang L., Lin Z.Q., Wong A. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 2020;10(1):1–12. doi: 10.1038/s41598-020-76550-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Harmon S.A., Sanford T.H., Xu S., Turkbey E.B., Roth H., Xu Z., Yang D., Myronenko A., Anderson V., Amalou A. Artificial intelligence for the detection of COVID-19pneumonia on chest CT using multinational datasets. Nat. Commun. 2020;11(1):1–7. doi: 10.1038/s41467-020-17971-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Oh Y., Park S., Ye J.C. Deep learning COVID-19 features on CXR using limited training data sets. IEEE Trans. Med. Imaging. 2020;39(8):2688–2700. doi: 10.1109/TMI.2020.2993291. [DOI] [PubMed] [Google Scholar]
  • 5.Fan D.-P., Zhou T., Ji G.-P., Zhou Y., Chen G., Fu H., Shen J., Shao L. Inf-Net: automatic COVID-19 lung infection segmentation from CT images. IEEE Trans. Med. Imaging. 2020;39(8):2626–2637. doi: 10.1109/TMI.2020.2996645. [DOI] [PubMed] [Google Scholar]
  • 6.Wang X., Deng X., Fu Q., Zhou Q., Feng J., Ma H., Liu W., Zheng C. A weakly-supervised framework for COVID-19classification and lesion localization from chest CT. IEEE Trans. Med. Imaging. 2020;39(8):2615–2625. doi: 10.1109/TMI.2020.2995965. [DOI] [PubMed] [Google Scholar]
  • 7.Yao Q., Xiao L., Liu P., Zhou S.K. Label-free segmentation of COVID-19 lesions in lung CT. IEEE Trans. Med. Imaging. 2021;40(10):2808–2819. doi: 10.1109/TMI.2021.3066161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Baur C., Denner S., Wiestler B., Navab N., Albarqouni S. Autoencoders for unsupervised anomaly segmentation in brain mr images: a comparative study. Med. Image Anal. 2021;69:101952. doi: 10.1016/j.media.2020.101952. [DOI] [PubMed] [Google Scholar]
  • 9.W. Liu, R. Li, M. Zheng, S. Karanam, Z. Wu, B. Bhanu, R.J. Radke, O. Camps, Towards visually explaining variational autoencoders, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8642–8651.
  • 10.D. Zimmerer, S.A. Kohl, J. Petersen, F. Isensee, K.H. Maier-Hein, Context-encoding variational autoencoder for unsupervised anomaly detection, arXiv preprint arXiv:1812.05941(2018).
  • 11.Biffi C., Cerrolaza J.J., Tarroni G., Bai W., De Marvao A., Oktay O., Ledig C., Le Folgoc L., Kamnitsas K., Doumou G. Explainable anatomical shape analysis through deep hierarchical generative models. IEEE Trans. Med. Imaging. 2020;39(6):2088–2099. doi: 10.1109/TMI.2020.2964499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626.
  • 13.O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 234–241.
  • 14.Aerts H., Velazquez E.R., Leijenaar R.T., Parmar C., Grossmann P., Cavalho S., Bussink J., Monshouwer R., Haibe-Kains B., Rietveld D. Data from NSCLC-radiomics. Cancer Imaging Arch. 2015 https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics [Google Scholar]
  • 15.M. Antonelli, A. Reinke, S. Bakas, K. Farahani, B.A. Landman, G. Litjens, B. Menze, O. Ronneberger, R.M. Summers, B. van Ginneken, et al., The medical segmentation decathlon, arXiv preprint arXiv:2106.05735(2021). [DOI] [PMC free article] [PubMed]
  • 16.Hofmanninger J., Prayer F., Pan J., Rhrich S., Prosch H., Langs G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 2020;4(1):1–13. doi: 10.1186/s41747-020-00173-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: feature learning by inpainting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544.
  • 18.Schlegl T., Seebck P., Waldstein S.M., Langs G., Schmidt-Erfurth U. F-anogan: fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019;54:30–44. doi: 10.1016/j.media.2019.01.010. [DOI] [PubMed] [Google Scholar]
  • 19.C. Baur, B. Wiestler, S. Albarqouni, N. Navab, Deep autoencoding models for unsupervised anomaly segmentation in brain mr images, in: International MICCAI Brainlesion Workshop, Springer, pp. 161–169.
  • 20.B. Zong, Q. Song, M.R. Min, W. Cheng, C. Lumezanu, D. Cho, H. Chen, Deep autoencoding Gaussian mixture model for unsupervised anomaly detection, in: International Conference on Learning Representations,
  • 21.S. You, K.C. Tezcan, X. Chen, E. Konukoglu, Unsupervised lesion detection via image restoration with a normative prior, in: International Conference on Medical Imaging with Deep Learning, PMLR, 540–556. [DOI] [PubMed]
  • 22.Zhou Z., Siddiquee M.M.R., Tajbakhsh N., Liang J. Springer; 2018. Unet++: A Nested U-Net Architecture for Medical Image Segmentation; pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.T.-Y. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125.
  • 24.A. Chaurasia, E. Culurciello, Linknet: exploiting encoder representations for efficient semantic segmentation, in: 2017 IEEE Visual Communications and Image Processing (VCIP), IEEE, pp. 1–4.
  • 25.Zhao H., Shi J., Qi X., Wang X., Jia J. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Pyramid scene parsing network; pp. 2881–2890. [Google Scholar]
  • 26.Y. Qiu, Y. Liu, S. Li, J. Xu, Miniseg: an extremely minimum network for efficient COVID-19 segmentation, arXiv preprint arXiv:2004.09750(2020). [DOI] [PubMed]
  • 27.H. Li, P. Xiong, J. An, L. Wang, Pyramid attention network for semantic segmentation, arXiv preprint arXiv:1805.10180(2018).
  • 28.L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587(2017).
  • 29.Chen L.-C., Zhu Y., Papandreou G., Schroff F., Adam H. Proceedings of the European Conference on Computer Vision (ECCV) 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation; pp. 801–818. [Google Scholar]
  • 30.Gao Y., Zhou M., Metaxas D.N. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Utnet: a hybrid transformer architecture for medical image segmentation; pp. 61–71. [Google Scholar]
  • 31.H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-Unet: Unet-like pure transformer for medical image segmentation, arXiv preprint arXiv:2105.05537(2021).
  • 32.J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A.L. Yuille, Y. Zhou, Transunet: transformers make strong encoders for medical image segmentation, arXiv preprint arXiv:2102.04306(2021).

Articles from Computer Methods and Programs in Biomedicine are provided here courtesy of Elsevier

RESOURCES