Skip to main content
Cell Reports Methods logoLink to Cell Reports Methods
. 2024 Jan 25;4(2):100695. doi: 10.1016/j.crmeth.2024.100695

A 3D lung lesion variational autoencoder

Yiheng Li 1,4, Christoph Y Sadée 1, Francisco Carrillo-Perez 1,2, Heather M Selby 1, Alexander H Thieme 1, Olivier Gevaert 1,3,
PMCID: PMC10921017  PMID: 38278157

Summary

In this study, we develop a 3D beta variational autoencoder (beta-VAE) to advance lung cancer imaging analysis, countering the constraints of conventional radiomics methods. The autoencoder extracts information from public lung computed tomography (CT) datasets without additional labels. It reconstructs 3D lung nodule images with high quality (structural similarity: 0.774, peak signal-to-noise ratio: 26.1, and mean-squared error: 0.0008). The model effectively encodes lesion sizes in its latent embeddings, with a significant correlation with lesion size found after applying uniform manifold approximation and projection (UMAP) for dimensionality reduction. Additionally, the beta-VAE can synthesize new lesions of varying sizes by manipulating the latent features. The model can predict multiple clinical endpoints, including pathological N stage or KRAS mutation status, on the Stanford radiogenomics lung cancer dataset. Comparisons with other methods show that the beta-VAE performs equally well in these tasks, suggesting its potential as a pretrained model for predicting patient outcomes in medical imaging.

Keywords: Computed tomography (CT), lung cancer imaging, beta variational-autoencoder (beta-VAE), 3D lung nodule synthesis, radiomics, self-supervised learning

Graphical abstract

graphic file with name fx1.jpg

Motivation

Conventional radiomics approaches are hindered by their predefined nature and a lack of adaptability to different applications. Furthermore, the supervised convolutional neural network model (CNN) often used relies heavily on labeled data. Here, we developed a 3D beta variational autoencoder (beta-VAE) to advance lung cancer imaging analysis that efficiently reconstructs and encodes lung nodule images from unlabeled CT datasets. This unsupervised model can capture features relevant for multiple downstream tasks and has potential as a pretrained model for predicting patient outcomes in medical imaging.

Highlights

  • A 3D beta-VAE reconstructs lung lesion volumes with high accuracy

  • The model encodes lesion size in its latent embeddings

  • The beta-VAE synthesizes varying lesion size effectively

  • It predicts clinical endpoints like N stage and KRAS mutations


In this study, Li et al. present a 3D beta variational autoencoder (beta-VAE) that reconstructs lung nodule images from CT scans and predicts key clinical outcomes in lung cancer. The model paves the way for enhanced patient outcome prediction.

Introduction

Computed tomography (CT) of the chest is the cornerstone of lung cancer imaging both for diagnosis and to determine further treatment.1 A chest CT scan with intravenous contrast enhancement is the gold standard in the diagnosis of pulmonary lesions. Moreover, CT imaging, with increased availability as well as clinical and economic effectiveness, has become a standard in the diagnosis and staging of lung cancer.2 In addition, CT imaging has also played an important role in the field of building statistical models for predicting disease-related outcomes like overall survival (OS) and progression-free survival (PFS).3,4

More recently, machine learning applications in medical imaging analysis have shown the potential of extracting information from radiographic images and making accurate predictions of patient outcomes. In particular, the field of quantitative imaging has seen a rapid explosion in the past decade ranging from predefined image feature extraction, also known as (aka) radiomics, to more recent deep learning modeling.5,6 Several example studies have been reported in recent years that use radiomics features of CT in the lung cancer domain, including phenotype discovery7 and malignancy prediction.8 In more recent years, deep learning has been proven successful in imaging tasks also applied in the medical imaging field, with an increased number of studies training end-to-end convolutional neural networks (CNNs). Successful examples include the LungNet, which predicts the prognosis of patients with lung cancer,6 predictions of epidermal growth factor receptor (EGFR) or KRAS mutations of patients with lung cancer using CT,9 and the detection of COVID-19 using an ensemble model of CNNs.10

However, these methods have their limitations. Predefined radiomics features are easier to implement and do not require huge datasets,7 but their performance can be limited due to their predefined nature and lack of adaptability to different downstream applications. Although radiomics procedures include automatic feature selection for relevant tasks, the scope of improvement is restricted. In contrast, they are typically trained specifically for a particular problem or downstream task and, as such, are hard to generalize to other endpoints.

Here, we propose a different approach to represent CT imaging data for patients with lung cancer using an unsupervised deep learning model. We developed a variational autoencoder (VAE)11 to generate CT imaging lung lesion embeddings in an unsupervised manner trained on multi-cohort, multi-vendor, and multi-institutional CT lung image data. This beta-VAE model, a variant of the traditional VAE model, can be trained without any labels, but at the same time, we show that this beta-VAE can capture relevant features for several downstream tasks.12,13 More specifically, we show that by leveraging large public datasets and maximizing the dataset heterogeneity with combined training, we can predict several downstream tasks. In addition, we show that this beta-VAE can also be used to generate synthetic data where we can control lesion size. This experiment shows the potential of synthesizing lung lesion data using the VAE without being hindered by limits like privacy issues when collecting real patient data, which is very beneficial for data-hungry studies like deep learning research.

Results

Multi-institutional data and workflow

The beta-VAE model consists of two major parts, the encoder and the decoder. Between the two, there is a bottleneck layer, which is where the condensed information is extracted (Figures 1A–1C). The model is trained to reconstruct the input 3D volume using three different datasets: LIDC,14 LNDb,15 and a Stanford dataset16 (Table 1). LIDC as well as LNDb are public datasets with a larger number of images and lung lesions, while the Stanford dataset is smaller than the previous two and contains purely malignant lung cancer lesions. The beta-VAE models were trained using these three datasets. We used three performance metrics to track model performance, the structural similarity index measure (SSIM), the mean square error (MSE), and peak signal-to-noise ratio (PSNR). Additionally, the middle slices were plotted to qualitatively illustrate the models' reconstruction performance.

Figure 1.

Figure 1

3D lung lesion beta-VAE training pipeline

(A) The pipeline of the reconstruction for 3D lung lesions. It goes through an encoder for information condensation. After the reparameterization trick, it is decoded back to a reconstructed 3D image.

(B) The encoder consists of four 3D convolutional blocks, and in each block, the spatial size of the image is halved, while the number of channels is increased (in the illustration, the width of the blocks denotes the number of channels, which is also shown by the digit in the block; the height of the block denotes the spatial size of the image, which is decreased at each block). The 3D convolutional block consists of 3 layers: firstly, a convolution transpose 3D layer, then a batch normalization layer, and finally, a leaky rectified linear unit (ReLU) activation layer.

(C) After the training, the latent random variable was extracted and concatenated as the deep features used for downstream task prediction. The model used for the downstream tasks is an XGBoost model.

(D) Visualization of the latent embeddings extracted from the beta-VAE with size encoded in different sizes. The unit of size is mm3. The dimensionality reduction was performed by the UMAP (uniform manifold approximation and projection) algorithm.

(E) Tumor volume of the randomly sampled 36 patches of Stanford dataset (“original”) as well as the synthetic images from the same batch by either subtracting the difference vector (“shrunk”) or adding the difference vector (“enlarged”).

Table 1.

Summary of the datasets used in this study: the LIDC, LNDb, and Stanford cohorts

Dataset LIDC LNDb Stanford
No. of participants 1,010 294 143
No. of lesions 2,088 1,519 143
Type of lung lesions benign and malignant lung cancer lesions benign and malignant lung cancer lesions malignant lung cancer lesions

Model evaluation and image reconstruction

To obtain the optimal beta-VAE model, we did several experiments to assess the models’ performance in reconstructing the 3D lesion CT images. When selecting the best model, our primary performance metric was the reconstruction image SSIM, assessed on the test portion, comprising 20% of the Stanford cohort. We compared models trained on each of the cohorts separately and combined. Notably, the model that amalgamated all datasets emerged as superior (Figure 1D).

To further validate our findings, we employed a 5-fold cross-validation on the Stanford dataset by retraining the latter two models and validating on different splits of the Stanford dataset. The cross-validation results (Table S1) underscore the robustness of the beta-VAE model trained on the combined datasets (LIDC, LNDb, and Stanford), achieving a mean SSIM of 0.774 with a standard deviation of 0.021, along with an MSE of 7.27e−4 (±1.05e−4) and a PSNR of 26.246 (±0.678). This model significantly outperforms those trained on individual datasets, as evidenced by higher mean SSIM values and lower standard deviations, indicating consistent performance across different data folds. As for fine-tuning, the model’s performance remained consistent and did not improve when fine-tuned on the training portion (80%) of the Stanford dataset (Figure S1).

The optimal performance of the beta-VAE models was seen on the model trained on the combined dataset of the LIDC and LNDb as well as Stanford datasets. The 3D CT patch images in the test data of the Stanford dataset were selected to evaluate the model’s image reconstruction performance. The images were encoded to get the latent embeddings and then reconstructed by the decoder (Figure 1E). This model achieved SSIM: 0.7743 ± 0.1302, PSNR: 26.0986 ± 2.9140, and MSE: 0.0008 ± 0.0005.

Visualizations of lesion latent variables

To understand the latent variable and potential trend of the numeric encodings of the different lesions, we select the volume size as an important feature of the lesions. We want to visualize the latent variables to see if there is any relationship with the volume size of the lesions. Latent embeddings were extracted from the Stanford dataset’s lesions, and their volume sizes are calculated by calculating the physical sizes of the manual segmentations of the lesions.

For the dimensionality reduction of the latent features, we employed the uniform manifold approximation and projection (UMAP) visualization technique.17 A scatterplot of the UMAP is shown in Figure 2B, with the size encoded by the sizes of the dots. A Spearman’s rank correlation test was conducted between the UMAP features and the lesion sizes, revealing a significant correlation with the size of the lesions (p < 0.001).

Figure 2.

Figure 2

Synthesis and analysis of lesion images using beta-VAE: visualization, size encoding, and volume comparison

(A) Visualizations of the synthesized lesion images, using the beta-VAE trained on the LIDC, LNDb, and Stanford datasets. (i) Images of small lesions (left 3 columns) and large lesions (right 3 columns) were selected from the Stanford dataset. These lesion embeddings were used to calculate the centroids, which were used to get the difference vector by subtracting the centroid of small lesions by the centroid of the large lesions. (ii) Visualizations of the 36 lesions with the smallest sizes in the Stanford dataset selected to be synthesized. (iii) Visualizations of the lesions synthesized using the trained beta-VAE by taking the latent embeddings of the lesions shown in (ii) and subtracting (shrunk) by the difference vector calculated with the lesions, whose examples are shown in (i). (iv) Visualizations of the lesions synthesized using the trained beta-VAE by taking the latent embeddings of the lesions shown in (ii) and adding (enlarged) the difference vector calculated with the lesions, whose examples are shown in (i). In summary, the synthesized images (iii and iv) are generated by enlarging or shrinking the original lesion images (ii) with a difference vector, which is calculated by subtracting the centroid of a batch (36) of small lesions from a batch of large lesions.

(B) Visualization of the latent embeddings extracted from the beta-VAE with size encoded in different sizes. The unit of size is mm3. The dimensionality reduction was performed by the UMAP algorithm.

(C) Tumor volume of the randomly sampled 36 patches of Stanford dataset (“original”) as well as the synthetic images from the same batch by either subtracting the difference vector (“shrunk”) or adding the difference vector (“enlarged”).

To further validate our findings across different dimensionalities, we expanded our analysis to include a 3D UMAP visualization. This additional analysis can be found in Figure S2. In line with our efforts to ensure thoroughness, we also calculated the correlation coefficients for these 3D projections. Notably, the second component demonstrated a significant correlation with lesion size (p < 0.001), further reinforcing the validity of our original observations.

Synthesizing lesion of different sizes

To delve deeper into the inner workings of the VAE, we conducted an initial experiment concerning lesion size, generating new lesion images with varying sizes. Initially, we curated the smallest 5% and the largest 5% of lesions from the Stanford dataset (Figure 2Ai), generating their embeddings utilizing the beta-VAE model. The difference vector between the centroids of the embeddings of these two lesion groups was then computed.

Furthering this investigation, we carried out additional experiments employing varying scales of the difference vector (0.25, 0.5, 0.75, 1, 1.5, 2), aiming to discern the effect on the synthesized lesions. Selecting a random batch of lesions from the Stanford dataset (Figure 2Aii), whose embeddings were extracted similarly via the beta-VAE, we manipulated the size of lesions by adding or subtracting the scaled difference vectors, thereby either enlarging or shrinking the lesions (Figure 2Aiii and 2Aiv). This exploration is illustrated in Figures S3 and S4.

The results show a gradual alteration in lesion size during the interpolation process, contributing to a nuanced understanding of lesion size modulation. However, during extrapolation, the lesions encountered certain morphological anomalies. Enlarged lesions exhibited a “hollow” effect, where the centroids appeared less opaque, rendering them less realistic. Conversely, shrunken lesions, when extrapolated, tended to assume a more homogeneous, rounded shape, diverging from their original, more complex morphology.

We quantitatively compared the lesion volumes pre- and post-synthesis, as depicted in a boxplot (Figure 2C), which further substantiated our observations. This extended analysis not only underscores the capabilities of our approach in synthesizing lesions of different sizes but also sheds light on the morphological alterations and the limitations encountered during the process, enriching the discourse on the meticulous manipulation and analysis of synthetic lesion imagery.

Beta-VAE is capable of predicting several downstream tasks

To assess the models’ capability of encoding lesion information in their bottleneck layer, we utilized several pathological and genomic phenotypic data from the Stanford patients as the prediction labels for the beta-VAE models. From previously trained models, predictions were performed on the Stanford dataset, where the latent embeddings of the Stanford lung cancer lesion images are recorded. These embeddings were standardized to a mean of 0 and a standard deviation of 1 and were later used as the features to predict the labels.

The downstream tasks are mainly lesion properties including the AJCC score, EGFR mutation state, KRAS mutation, lymphovascular invasion, pathological N stage, and pathological T stage (Table 2). To predict these labels, we formulate these tasks as binary classification tasks. The beta-VAE is compared with two baseline models: XGBoost classifier model prediction using PyRadiomics features (PyRadiomics model) and an end-to-end CNN model with the same model architecture as the encoder of the beta-VAE (CNN model). Additionally, we explored the ability of the beta-VAE’s latent features to classify the same downstream tasks using an XGBoost classifier.

Table 2.

Descriptions and clinical insights of the downstream task labels

Downstream task names Descriptions
AJCC score the AJCC Cancer Staging System is a standardized system for assessing the extent of cancer spread, with the intent of aiding in the planning of treatment and estimation of prognosis; it employs a TNM (tumor, node, metastasis) framework to provide detailed information on tumor size and invasion (T), lymph node involvement (N), and presence of distant metastasis (M)18
EGFR mutation state EGFR (epidermal growth factor receptor) mutations typically occur in exons 18–21 and are recognized as driver mutations in non-small cell lung cancer (NSCLC); grouping EGFR mutations by structure and function has been suggested as an effective framework for matching patients with NSCLC to appropriate targeted therapies19
KRAS mutation RAS (Kirsten rat sarcoma viral oncogene homolog) is a well-known oncogene with the highest mutation rate among all cancers, associated with several highly fatal cancers including pancreatic ductal adenocarcinoma, NSCLC, and colorectal cancer20
Lymphovascular invasion (LVI) LVI is defined as a condition where cancer cells have infiltrated or grown into small blood vessels or lymph vessels, indicating that the cancer cells have broken free of the primary tumor and have the potential to spread throughout the body; it is an independent risk factor for metastasis, recurrence, and mortality, where the presence of tumor cells within a definite endothelial-lined space (lymphatics or blood vessels) is observed, often in the breast surrounding invasive carcinoma21
Pathological N stage in the TNM system, the pathological N stage (pN) is derived from the resection and examination of at least the lower axillary lymph nodes (level I) and reflects the involvement of lymph nodes by cancer; the classification pN0 is used if the lymph nodes are negative but the ordinarily examined number is not met21,22
Pathological T stage the pathological T stage (pT) in the TNM staging system refers to the size of the tumor and how far it has grown into nearby tissues; it is part of the pathologic stage of cancer, which also considers the involvement of lymph nodes (N stage) and distant metastasis (M stage); the pathologic T stage is denoted with individual scores ranging from T0 to T423

The results show that for all comparisons, the beta-VAE performs equally well as a CNN model trained explicitly on each downstream task (Figure 3). The beta-VAE also performs equally well as radiomics features for all comparisons. The combined model of radiomics and beta-VAE by concatenating both feature vectors also performs at a similar level statistically to the baseline models and the beta-VAE model. When predicting the pathological T stage, the combined model achieved the highest average F1 score among all the models. The combined model of PyRadiomics and beta-VAE by concatenating both features also performs at a similar level statistically to the baseline models and the beta-VAE model. This emphasizes that the beta-VAE can capture relevant information about the lesions and further suggests that a beta-VAE trained in an unsupervised manner has potential as a general feature extractor for a wide range of downstream tasks.

Figure 3.

Figure 3

Boxplot comparison of F1 scores across four models on Stanford dataset: VAE, CNN, radiomics feature models, and combined radiomics-VAE features for multiple downstream labels

The beta-VAE models are built with the latent embeddings of the lesions extracted from the bottleneck layer as deep features, and XGBoost models are trained on those features to make predictions on each task. The CNNs are trained end to end using the Stanford dataset 3D images to predict the labels. The radiomics model is built by first extracting the radiomics features using the python package PyRadiomics and then using the XGBoost models to make predictions using the extracted features. The comparison is made by repeating each model using a 10-fold cross-validation strategy with stratification. t tests are performed to estimate the performance differences between beta-VAE and the other two models. The “ns” annotation indicates that the two models have no significant differences in performance. A) Boxplot comparison of F1 scores for KRAS mutation status; B) Boxplot comparison of F1 scores for EGFR mutation status; C) Boxplot comparison of F1 scores for Lymphovascular invasion D) Boxplot comparison of F1 scores for Pathological N stage; E) Boxplot comparison of F1 scores for Pathological T stage; F) Boxplot comparison of F1 scores for AJCC staging.

Discussion

We built an unsupervised lung lesion 3D CT patch VAE that leverages multi-institutional data, a beta-VAE. This model can successfully reconstruct lung lesion 3D volumes by representing these volumes in lower-dimensional embeddings in an unsupervised manner (Figure 4). These embeddings were demonstrated to be not only useful in medical image synthesis and interpolation but also when predicting downstream tasks of patients with cancer.

Figure 4.

Figure 4

The flowchart of the synthetic experiment

The largest and smallest patches were sampled from the Stanford dataset, from which the different vector was calculated. Using this vector, we “moved” the embeddings from a batch of randomly sampled lesions in the hyperspace. These “moved” patches were decoded and visualized.

We evaluated various beta-VAE models, training them on multi-institutional datasets to optimize model selection and leverage dataset heterogeneity for improved performance. The reconstruction image quality results showed that adding heterogeneity to the training data (i.e., including more datasets) was beneficial to the model’s performance on the Stanford dataset. A model only trained on one cohort has the worst performance compared to models that were trained with more than one dataset (Figure 1D). This is a promising result, as it also suggests the power of pretraining a beta-VAE model and its potential to continue to improve if trained on more datasets. This is an important advantage of the beta-VAE compared to end-to-end CNN models, as the beta-VAE is trained in an unsupervised manner and does not need labels.

Next, a visualization of the embeddings using UMAP showed that the model encodes lesion size. To more explicitly understand how the model captures size, we attempted to manipulate lesion size by calculating a difference vector from the smallest and largest lesions and then adding or subtracting the difference to images’ embeddings and reconstructing them by moving samples in the learned latent space. This analysis showed that synthesized images’ lesion sizes can be manipulated by simply moving the latent embeddings in the selected direction. This suggests that the beta-VAE can capture size, which can open several potential applications including synthetic data generation of lesions with different sizes and tumor growth modeling. Since the sizes are encoded in the latent space, it reveals the potential that more critical information, like new phenotypes or malignancy level, might also be captured, and these can be explored in the future using clustering or classification methods.

Finally, we explored the use of the beta-VAE embedding in downstream tasks for a lung cancer cohort. We tested the beta-VAE on several clinically meaningful labels including mutation prediction and pathological characteristics. We compared the beta-VAE with two popular approaches in prediction tasks of medical imaging: a radiomic model, which uses predefined features, and an end-to-end supervised CNN model trained separately for each single task. In most of the cases (6 out of 7; on 1 task, none of the 3 models were able to yield predictions better than random guess), the beta-VAE model shows non-inferior performance compared to either of the two baseline models. It is important to note that for downstream task predictions, the beta-VAE model has the potential to be further optimized and gain higher performance when trained with additional data without much manual labeling; while radiomic models rely on predefined features that may inherently limit their potential compared to data-driven approaches like deep learning, it is important to note that their performance can still be improved with more samples. Additionally, CNN models need high-quality labels for all the data they are trained on, which is time and human-resource consuming, and need to be retrained for each task. After training the beta-VAE, when there is a need to build a model for a new task, the beta-VAE models can be used as a powerful feature extractor, and a traditional machine learning approach can be used to predict the task.

There have been multiple attempts to strengthen the performance of VAEs. Aaron et al. developed the VQ-VAE24 to use a discrete latent space library to select latent variables, and Pihlgren25 explored the potential that adding perceptual loss in the beta-VAE will increase the deep feature extraction performance. This study has not exploited perceptual loss of VAE because there are limited open-source 3D models that are pretrained on large medical imaging datasets. This study did not choose the latest VQ-VAE architecture because the VQ-VAE chooses the embeddings from a discrete library where the VAE loses the ability to sample from a continuous latent space. Next, we emphasize that the beta-VAE can benefit from further training on additional public datasets of lung lesions or lung cancer that may potentially boost the beta-VAE model’s performance, for example the recently released LOTUS dataset.26 Thanks to the nature of unsupervised training, we expect that the beta-VAE can continuously improve with adding more data.

In summary, we proposed a beta-VAE that can reconstruct 3D lung CT patches with lesions with high quality. We also demonstrated that our model could extract meaningful features unsupervised by visualizing the embeddings and testing their correlations with the lesion sizes. In addition, we showed that the beta-VAE can artificially “control” lesion sizes, which opens possibilities for modeling lung tumor growth and for generating synthetic data. Finally, we show that, using the embeddings, we can build predictive models for several clinical labels, with the performance non-inferior to radiomics models and CNNs. For future work, we will continue to explore VAEs and their applications as information extractors in medical imaging. Potential future directions include the use of perceptual loss and discrete latent spaces, as well as extending the models to other image modalities such as MRI.

Limitations of the study

Our study has limitation that warrant further discussion. Our method relies on the centroid of the tumor to extract the 3D patch, which introduces a level of supervision. However, it is important to note that annotating the centroid is less labor intensive than full pixel-level annotation. Future studies could explore the use of more advanced unsupervised methods to eliminate the need for centroid annotation.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Yiheng Li (yyhhli@stanford.edu).

Materials availability

This study did not generate new unique reagents.

Data and code availability

Method details

Datasets

The study utilized the LIDC-IDRI dataset (The Lung Image Database Consortium image collection, LIDC).14 The dataset consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions, with a total of 557 patients and 2129 different lesions. The annotations include lesion centroid coordinates, volume size, malignancy, subtlety, texture, and spiculation. Note that the annotations are subjective judgments by radiologists. The dataset is randomly split, at the patient level, into training, validation, and testing sets (445, 56, 56), with a ratio of 8:1:1.

The LNDb dataset,15 which focuses on the development and testing of pulmonary nodule computer-aided strategies, aiming to address late lung cancer detection challenges and radiologist variability, is also used in this study. We included 212 CT scans collected retrospectively at the Centro Hospitalar e Universitário de São João (CHUSJ) in Porto, Portugal between 2016 and 2018. In total, 1033 lesions were labeled by 3 radiologists, with their centroid coordinates, lesion volume size, and texture rating information. The dataset is randomly split, at the patient level, into training, validation, and testing sets (168, 22, 22), with a ratio of 8:1:1.

Next, the Stanford radiogenomics dataset16 is an NSCLC CT dataset collected from Stanford University. Compared to the first two public datasets, this dataset contains lung cancer lesions. The dataset is randomly split, at the patient level, into training, and testing sets with 100 and 43 patients respectively. For this dataset, various downstream labels related to the patient, or the lesion are extracted and utilized as prediction labels. This dataset is mainly aimed at the evaluation tasks of the models. Because it’s a lung cancer dataset with real lung cancer lesions, whereas the public datasets contain mainly lung lesions. And we have various labels for the patient outcomes and pathological classifications.

Image preprocessing

Given the variability in the resolution and slice thickness across these three datasets, we decided to interpolate all datasets onto a 1 × 1 × 1 mm setting to ensure a consistent representation of the data for training our 3D variational auto-encoder. This standardization allows for a more robust comparison of the features extracted by the beta-VAE across different datasets and facilitates the generalizability of the model. Furthermore, by using the 1 × 1 × 1 mm resolution, we can preserve the lesion/nodule size and shape across datasets, ensuring that the model captures the relevant characteristics of the lung nodules in a consistent manner. Besides, the pixels were standardized to be 0 mean 1 variance in their own dataset. And then 3D patches were extracted around each lesion centroid coordinate to get a 3D patch that shows the lesion and its surrounding area. The lesion centroid coordinates were extracted from either the weighted centroid of the segmentations of the lesions (LIDC and Stanford radiogenomics dataset) or directly from the annotated centroid from the metadata of the dataset (LNDb), where the segmentations or centroids were obtained from experienced radiologists.

We implemented a combined dataset that can load preprocessed 3D images from any of the three datasets. During training, we randomized the data loading of the combined dataset, so it can randomly load from any dataset and create a longer as well as more diverse training set.

Image augmentation was also used in the training of models. For every lesion and 3D patch, besides the original patch, other 5 patches were also extracted with augmentation methods including random rotation and shifting, where the rotation degree was chosen from minus plus 20° and the shifting range is minus plus 15 pixels.

Model architecture

The variational auto-encoder is an unsupervised image reconstruction model that has the structure for information condense. It consists of an encoder, a reparameterization layer, and a decoder. The encoder is a series of convolutional blocks that shrinks the input image and outputs the mean and standard deviation vector for the latent variable. The latent space has a dimensionality of 1024 means and 1024 standard deviations. The reparameterization layer applies the reparameterization trick, which enables the sampling of the latent variable without directly using stochastic nodes, thus making the model differentiable and allowing for efficient backpropagation during training. This is achieved by sampling from a standard Gaussian distribution and then transforming the samples using the mean and standard deviation vectors produced by the encoder. Consequently, the reparameterization trick contributes to the smoothness of the latent hyperspace, enhancing the model’s ability to preserve information. The latent variable is then fed to the decoder, which has a mirrored structure as the encoder and reconstructs the image of the original size. The dimensions of the convolutional layers of the encoder are 8, 32, 128, 512 with four layers. The decoder follows the opposite order with the same dimensions.

To determine the optimal architecture and learning rate, we conducted experiments evaluating the reconstruction image quality on the test set. We considered several metrics, including Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), with a particular focus on SSIM as the primary deciding factor. The chosen architecture and learning rate were those that yielded the best performance according to these metrics, striking a balance between model complexity and reconstruction quality.

Model training and validation

The training of the variational auto-encoder utilized two losses: the reconstruction loss (Equation 1), which is a pixel-wise mean squared error between the reconstructed image, and the original input image; and the KL divergence loss27 (Equation 1), which is a regularization loss that penalizes the divergence between the variational distribution and a standard multivariate Gaussian distribution. The hyper-parameter that adjusts the ratio of the model’s loss weights is beta, which applies to the KL loss:

Lbeta(φ,β)=Ezqφ(z|x)logpθ(x|z)+βDKL(qφ(z|x)pθ(z))

The models are trained with a learning rate of 2.0e-5 with a one-cycle learning rate scheduler with the final dividing factor to be 1e5 and the maximum learning rate to be 5e-4. The beta is selected to be 1e-5.

This is the loss function of beta VAE; the first term is MSE reconstruction loss, and the second term is the KL divergence loss with beta. One of the ways to evaluate the VAE models is to measure the similarity between reconstructed images and the original images. Besides visually examining the similarity, there are algorithms to quantify this. There are three metrics that were used to assess the reconstruction quality: mean squared error (MSE), which is also the loss function. It looks at the pixel-wise difference between reconstructed images and original ones and gives a sense of how well the model is converging to the cases. The structural similarity index (SSIM) measures the structural similarity of images. Peak signal-to-noise ratio (PSNR) is another measure of image quality based on MSE:

MSE=1ni=1n(YiYˆi)2
SSIM(x,y)=2(μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2)
PSNR=20log10(MAX1)10log10(MSE)

These are the formulas for MSE, SSIM, and PSNR metrics. For MSE, Yi is pixel value and Yˆi is predicted pixel value. The SSIM is calculated on fixed-sized windows of different images, this formula shows how it’s calculated on two windows x and y, with μ to be the means and σ2 to be the variance. c1 and c2 are variables to stabilize the weak denominator. PSNR is calculated based on MAX1 which is the maximum possible pixel value of the image, and MSE.

Multiple models were trained using the 3 datasets. They are trained with different dataset training strategies and the reconstructed image qualities as well as the performance of predicting downstream tasks in the Stanford dataset were compared. To ensure that the models can all converge on their own training set and the performance differences are fully due to the different information that their training strategy can provide, we controlled the total number of steps that each model is trained and the batch size of each step. Based on the training loss curve, we determined that all the training stops after 65,000 steps.

Synthesizing lesions of different sizes

Variational auto-encoders are known to have the ability for interpolating and extrapolating images. Here we designed an experiment to demonstrate the ability of our 3D beta-VAE to extrapolate 3D CT image patches. Although lung cancer lesions have many radiological properties, we choose size to experiment on, since it is relatively subjective, and we can get that information from lesion segmentations without needing extra labeling. From the Stanford dataset, the volumes are ranked and the largest (Xlargest) as well as the smallest (Xsmallest) 36 lesions were sampled and inputted through the beta-VAE model. We stored the output mean from the last layer of the encoder for each lesion image and calculated the difference vector (Vdiff) of their centroids.

Vdiff=mean(Enc(Xlargest)Enc(Xsmallest))

Then, to synthesize the images, we randomly sampled a new batch of lesion patches from the Stanford dataset (Xsampled). By applying addition or subtraction to each of the embeddings of the randomly selected batch, we put the moved embeddings back to the decoder to get synthetic images (Xsynthetic).

Xsynthetic=Dec(Enc(Xsampled)±Vdiff)

Quantification and statistical analysis

Downstream analysis

To understand how well the variational auto-encoder model is capturing lesion information, downstream analysis was performed by generating the embeddings of each lesion by stacking the output of the encoder (stacking mean and variance vector). Association analysis is performed between the embeddings and the annotation labels, with spearman’s correlation and false discovery rate (FDR) correction, to investigate the radiological meaningfulness of embedding features. And then multiple statistical learning models are trained and validated to assess the ability to classify the downstream tasks for the encoded embeddings.

In this study, we utilized several labels for downstream tasks, including AJCC score, pathological N stage, pathological T stage, lymphovascular invasion, EGFR mutation status, and KRAS mutation status. The AJCC score is a system to describe the amount and spread of cancer in a patient’s body, using TNM.28 We categorized all the patient’s AJCC scores into I, II, and III stages, for this study, we predict patients in stage I versus other patients that are in more advanced stages. This stratification is clinically relevant, as it helps determine the appropriate treatment and prognosis for each patient, with earlier stages generally associated with better outcomes and less aggressive interventions. The N stage and T stages are essential descriptors of a cancer’s status, including its location, size, extent of growth into nearby tissues, and spread to nearby lymph nodes or other body parts.29 T1-T4 denote different sizes and locations of the tumor, while N stages denote if cancer has affected the lymph nodes. We regroup these two labels into N0 verses above and T1 verses above. This classification helps tailor treatments and prognosis based on the cancer’s stage, enabling more personalized patient care. Lymphovascular invasion (LVI) is the movement of cancer cells into either a blood or lymphatic vessel.30 The presence of LVI can signal a higher risk of metastasis and poorer prognosis, thus influencing treatment strategies. We also considered the mutation status of two crucial genes, EGFR and KRAS, which are closely related to lung cancer.31 The wild type was set to label 0 while the mutant type was set to label 1. Identifying the mutation status of these genes is essential for personalized treatment plans, as some therapies specifically target or are more effective in the presence of certain mutations. Additionally, these mutations can impact patient outcomes and responsiveness to certain treatments, making their assessment vital for effective patient management.

The XGBoost model32 was used for the downstream task prediction. And 10-fold cross-validation was used for evaluating the model’s performance on the downstream tasks’ predictions on the Stanford dataset.

PyRadiomics model

To assess the performance of the beta-VAE-based information extractor, the models were compared with two baselines: the PyRadiomics7 models and the CNN models. The PyRadiomics model extracts over 1,600 features, including first-order statistics, shape-based features, and texture features (such as Gray Level Co-occurrence Matrix, Gray Level Run Length Matrix, and Gray Level Size Zone Matrix) We used the default settings and feature extraction parameters provided by the library. An XGBoost classifier is trained on these extracted features to predict each task.

CNN model

The CNN model was trained on the extracted 32 × 32× 32 augmented patches. The CNN model is based on a 3D convolutional neural network architecture, consist of 4 convolutional blocks which is the same architecture as the encoder of the beta-VAE, as well as a fully connected layer and a SoftMax layer. The model includes batch normalization, activation functions (ReLU), and max-pooling layers, followed by fully connected layers and a SoftMax activation for multi-class classification tasks. The CNN model employed the Adam optimizer with the learning rate of 2e-5 and was trained for 500 epochs on the Stanford dataset.

The PyRadiomics method yields human defined radiomics features, while the beta-VAE yields unsupervised generated features. It’s interesting to see if these two sets of features can compensate for each other’s limitations and provide supplementary information in the downstream prediction. Thus, we add a combined model of PyRadiomics and beta-VAE by concatenating their feature output in the downstream prediction task.

The evaluation and comparison were done using cross-validation. The Stanford dataset is randomly split into 10-folds, where in each fold, the XGBoost models used in the beta-VAE and in the PyRadiomics model as well as the CNN model are retrained on the training split and evaluated on the test split.

Acknowledgments

F.C.-P. was supported by the Spanish Ministry of Sciences, Innovation and Universities under projects RTI-2018-101674-B-I00 and PID2021-128317OB-I00, the project from Junta de Andalucia P20-00163, and a predoctoral scholarship from the Fulbright Spanish Commission.

Author contributions

Y.L. is responsible for the experiment design, coding, deep learning model building and testing, making the majority of figures, and article drafting. C.Y.S. is responsible for the segmentation to the synthesized lesions and making the lesion size comparison figure. F.C.-P. is responsible for giving ideas and advice about the size synthesis experiment. H.M.S. is responsible for providing data access and article editing and proofreading. A.H.T. is responsible for giving advice and article editing and proofreading. O.G. is responsible for providing the resources, coming up with the project idea, giving advice, and article editing and proofreading.

Declaration of interests

The authors declare no competing interests.

Published: January 25, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2024.100695.

Supplemental information

Document S1. Figure S1–S4 and Table S1
mmc1.pdf (3.3MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (7.6MB, pdf)

References

  • 1.Purandare N.C., Rangarajan V. Imaging of lung cancer: Implications on staging and management. Indian J. Radiol. Imaging. 2015;25:109–120. doi: 10.4103/0971-3026.155831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dziedzic R., Marjański T., Rzyman W. A narrative review of invasive diagnostics and treatment of early lung cancer. Transl. Lung Cancer Res. 2021;10:1110–1123. doi: 10.21037/tlcr-20-728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zerunian M., Caruso D., Zucchelli A., Polici M., Capalbo C., Filetti M., Mazzuca F., Marchetti P., Laghi A. CT based radiomic approach on first line pembrolizumab in lung cancer. Sci. Rep. 2021;11:6633. doi: 10.1038/s41598-021-86113-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Win T., Miles K.A., Janes S.M., Ganeshan B., Shastry M., Endozo R., Meagher M., Shortman R.I., Wan S., Kayani I., et al. Tumor heterogeneity and permeability as measured on the CT component of PET/CT predict survival in patients with non-small cell lung cancer. Clin. Cancer Res. 2013;19:3591–3599. doi: 10.1158/1078-0432.CCR-12-1307. [DOI] [PubMed] [Google Scholar]
  • 5.Kothari G., Korte J., Lehrer E.J., Zaorsky N.G., Lazarakis S., Kron T., Hardcastle N., Siva S. A systematic review and meta-analysis of the prognostic value of radiomics based models in non-small cell lung cancer treated with curative radiotherapy. Radiother. Oncol. 2021;155:188–203. doi: 10.1016/j.radonc.2020.10.023. [DOI] [PubMed] [Google Scholar]
  • 6.Mukherjee P., Zhou M., Lee E., Schicht A., Balagurunathan Y., Napel S., Gillies R., Wong S., Thieme A., Leung A., Gevaert O. A Shallow Convolutional Neural Network Predicts Prognosis of Lung Cancer Patients in Multi-Institutional CT-Image Data. Nat. Mach. Intell. 2020;2:274–282. doi: 10.1038/s42256-020-0173-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.van Griethuysen J.J.M., Fedorov A., Parmar C., Hosny A., Aucoin N., Narayan V., Beets-Tan R.G.H., Fillion-Robin J.-C., Pieper S., Aerts H.J.W.L. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Res. 2017;77:e104–e107. doi: 10.1158/0008-5472.CAN-17-0339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Alahmari S.S., Cherezov D., Goldgof D., Hall L., Gillies R.J., Schabath M.B. Delta Radiomics Improves Pulmonary Nodule Malignancy Prediction in Lung Cancer Screening. IEEE Access. 2018;6:77796–77806. doi: 10.1109/ACCESS.2018.2884126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Le N.Q.K., Kha Q.H., Nguyen V.H., Chen Y.-C., Cheng S.-J., Chen C.-Y. Machine Learning-Based Radiomics Signatures for EGFR and KRAS Mutations Prediction in Non-Small-Cell Lung Cancer. Int. J. Mol. Sci. 2021;22 doi: 10.3390/ijms22179254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kundu R., Singh P.K., Mirjalili S., Sarkar R. COVID-19 detection from lung CT-Scans using a fuzzy integral-based CNN ensemble. Comput. Biol. Med. 2021;138 doi: 10.1016/j.compbiomed.2021.104895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Higgins I., Matthey L., Pal A., Burgess C., Glorot X., Botvinick M., Mohamed S., Lerchner A. 2022. Beta-VAE: Learning basic visual concepts with a constrained variational framework. International. [Google Scholar]
  • 12.Wang S., Liu Z., Chen X., Zhu Y., Zhou H., Tang Z., Wei W., Dong D., Wang M., Tian J. Unsupervised Deep Learning Features for Lung Cancer Overall Survival Analysis. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2018;2018:2583–2586. doi: 10.1109/EMBC.2018.8512833. [DOI] [PubMed] [Google Scholar]
  • 13.Crespi L., Loiacono D., Chiti A. 2021 IEEE Symposium Series on Computational Intelligence (SSCI) IEEE; 2021. Chest X-Rays Image Classification from $\beta{-}$ Variational Autoencoders Latent Features; pp. 1–8. [Google Scholar]
  • 14.Armato S.G., McLennan G., Bidaut L., McNitt-Gray M.F., Meyer C.R., Reeves A.P., Zhao B., Aberle D.R., Henschke C.I., Hoffman E.A., et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans. Med. Phys. 2011;38:915–931. doi: 10.1118/1.3528204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pedrosa J., Aresta G., Ferreira C., Rodrigues M., Leitão P., Carvalho A.S., Rebelo J., Negrão E., Ramos I., Cunha A., et al. 2019. LNDb: A Lung Nodule Database on Computed Tomography. [Google Scholar]
  • 16.Bakr S., Gevaert O., Echegaray S., Ayers K., Zhou M., Shafiq M., Zheng H., Benson J.A., Zhang W., Leung A.N.C., et al. A radiogenomic dataset of non-small cell lung cancer. Sci. Data. 2018;5 doi: 10.1038/sdata.2018.202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McInnes L., Healy J., Saul N., Großberger L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018;3:861. [Google Scholar]
  • 18.Torous V.F., Oliva E. On the new (version 9) American Joint Committee on Cancer tumor, node, metastasis staging for cervical cancer-A commentary. Cancer Cytopathol. 2021;129:581–582. doi: 10.1002/cncy.22486. [DOI] [PubMed] [Google Scholar]
  • 19.Robichaux J.P., Le X., Vijayan R.S.K., Hicks J.K., Heeke S., Elamin Y.Y., Lin H.Y., Udagawa H., Skoulidis F., Tran H., et al. Structure-based classification predicts drug response in EGFR-mutant NSCLC. Nature. 2021;597:732–737. doi: 10.1038/s41586-021-03898-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Huang L., Guo Z., Wang F., Fu L. KRAS mutation: from undruggable to druggable in cancer. Signal Transduct. Target. Ther. 2021;6:386. doi: 10.1038/s41392-021-00780-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Reig B., Moy L., Sigmund E.E., Heacock L. Diffusion MRI of the breast. Chapter 4 - Biomarkers, Prognosis, and Prediction Factors. 2023. pp. 49–70. [DOI] [Google Scholar]
  • 22.Pathologic N https://staging.seer.cancer.gov/tnm/input/1.3/breast/path_n/?(∼view_schema∼,∼breast∼)#:∼:text=Note%201%3A%20The%20pathologic%20classification,%281
  • 23.Wang H.-H., Li K., Xu H., Sun Z., Wang Z.-N., Xu H.-M. Improvement of T stage precision by integration of surgical and pathological staging in radically resected stage pT3-pT4b gastric cancer. Oncotarget. 2017;8:46506–46513. doi: 10.18632/oncotarget.14828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.van den Oriol Vinyals, Koray Kavukcuoglu O.A. NIPS; 2017. Neural Discrete Representation Learning. [Google Scholar]
  • 25.Pihlgren G.G., Sandin F., Liwicki M. 2020 International Joint Conference on Neural Networks (IJCNN) 2020. Improving Image Autoencoder Embeddings with Perceptual Loss. [Google Scholar]
  • 26.P. Afshar, A. Mohammadi, K.N. Plataniotis, K. Farahani, J. Kirby, A. Oikonomou, A. Asif, L. Wee, A. Dekker, X. Wu, et al. 888 Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark. Preprint at arXiv. 10.48550/arXiv.2201.00458. [DOI]
  • 27.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Statist. 1951;22:79–86. [Google Scholar]
  • 28.NCI Dictionary of Cancer Terms . 2011. National Cancer Institute.https://www.cancer.gov/publications/dictionaries/cancer-terms [Google Scholar]
  • 29.Stages of Cancer . 2010. Cancer.net.https://www.cancer.net/navigating-cancer-care/diagnosing-cancer/stages-cancer [Google Scholar]
  • 30.Lymphovascular invasion (LVI) 2020. MyPathologyReport.ca.https://www.mypathologyreport.ca/definition-lymphovascular-invasion/ [Google Scholar]
  • 31.Wang X., Ricciuti B., Nguyen T., Li X., Rabin M.S., Awad M.M., Lin X., Johnson B.E., Christiani D.C. Association between Smoking History and Tumor Mutation Burden in Advanced Non–Small Cell Lung Cancer. Cancer Res. 2021;81:2566–2573. doi: 10.1158/0008-5472.CAN-20-3991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tianqi Chen University of Washington, Seattle, WA, USA, and Carlos Guestrin University of Washington, Seattle, WA, USA XGBoost. ACM Conferences.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figure S1–S4 and Table S1
mmc1.pdf (3.3MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (7.6MB, pdf)

Data Availability Statement


Articles from Cell Reports Methods are provided here courtesy of Elsevier

RESOURCES