XRayWizard: Reconstructing 3-D lung surfaces from a single 2-D chest x-ray image via Vision Transformer

Zhiyi Shi; Kaiwen Geng; Xiaoyan Zhao; Farhad Mahmoudi; Christopher J Haas; Joseph K Leader; Emrah Duman; Jiantao Pu

doi:10.1002/mp.16781

. Author manuscript; available in PMC: 2026 Feb 22.

Published in final edited form as: Med Phys. 2023 Oct 11;51(4):2806–2816. doi: 10.1002/mp.16781

XRayWizard: Reconstructing 3-D lung surfaces from a single 2-D chest x-ray image via Vision Transformer

Zhiyi Shi ¹, Kaiwen Geng ¹, Xiaoyan Zhao ¹, Farhad Mahmoudi ¹, Christopher J Haas ¹, Joseph K Leader ¹, Emrah Duman ¹, Jiantao Pu ^1,²

PMCID: PMC12923332 NIHMSID: NIHMS2139727 PMID: 37819009

Abstract

Background:

Chest x-ray is widely utilized for the evaluation of pulmonary conditions due to its technical simplicity, cost-effectiveness, and portability. However, as a two-dimensional (2-D) imaging modality, chest x-ray images depict limited anatomical details and are challenging to interpret.

Purpose:

To validate the feasibility of reconstructing three-dimensional (3-D) lungs from a single 2-D chest x-ray image via Vision Transformer (ViT).

Methods:

We created a cohort of 2525 paired chest x-ray images (scout images) and computed tomography (CT) acquired on different subjects and we randomly partitioned them as follows: (1) 1800 - training set, (2) 200 - validation set, and (3) 525 - testing set. The 3-D lung volumes segmented from the chest CT scans were used as the ground truth for supervised learning. We developed a novel model termed XRayWizard that employed ViT blocks to encode the 2-D chest x-ray image. The aim is to capture global information and establish long-range relationships, thereby improving the performance of 3-D reconstruction. Additionally, a pooling layer at the end of each transformer block was introduced to extract feature information. To produce smoother and more realistic 3-D models, a set of patch discriminators was incorporated. We also devised a novel method to incorporate subject demographics as an auxiliary input to further improve the accuracy of 3-D lung reconstruction. Dice coefficient and mean volume error were used as performance metrics as the agreement between the computerized results and the ground truth.

Results:

In the absence of subject demographics, the mean Dice coefficient for the generated 3-D lung volumes achieved a value of 0.738 ± 0.091. When subject demographics were included as an auxiliary input, the mean Dice coefficient significantly improved to 0.769 ± 0.089 (p < 0.001), and the volume prediction error was reduced from 23.5 ± 2.7%. to 15.7 ± 2.9%.

Conclusion:

Our experiment demonstrated the feasibility of reconstructing 3-D lung volumes from 2-D chest x-ray images, and the inclusion of subject demographics as additional inputs can significantly improve the accuracy of 3-D lung volume reconstruction.

Keywords: 2-D chest x-ray, 3-D reconstruction, deep learning, lung reconstruction, vision transformer

1 |. INTRODUCTION

Chest x-ray is widely used to assess lung disease and monitor change over time. Compared to more advanced imaging techniques, such as computed tomography (CT) or magnetic resonance imaging (MRI), chest x-ray imaging has the advantage of being widely available due to lower technical needs, portability, and less expensive. However, as a planar [two-dimensional (2-D)] imaging modality, chest x-ray imaging produces projection images with superimposed anatomical structures. Consequently, chest x-ray images depict limited anatomical details and are challenging to interpret, which can result in a modest inter-reader agreement. In contrast, CT images depict significantly more normal and abnormal anatomical detail (e.g., boundaries, extent, heterogeneity, and solidness) and permit a detailed quantitative assessment of disease.^1–5 Unfortunately, CT imaging has a higher cost and radiation exposure.

To address the challenges in interpreting 2D x-ray images, there have been a few studies performed in an attempt to reconstruct 3-D images or anatomical structures from a single x-ray image using deep learning, primarily based on convolutional neural networks (CNN). Shen et al.⁶ developed a CNN model that utilized an encoder-decoder structure to reconstruct volumetric images from single or multiple projection views. The encoders extracted pertinent feature information, such as the position and size of organs, from 2-D projections. To facilitate this process, a cross-dimensional transformation module was employed to convert the extracted information to feature domain inputs. Subsequently, the decoders leveraged these inputs to generate 3D images enriched with volumetric details. Similarly, Henzler et al.⁷ presented a CNN model with an encoder-decoder structure augmented skip connections, designed specifically for volume prediction from high-resolution 2-D x-ray images. In contrast, Kasten et al.⁸ introduced an end-to-end CNN approach that focused on the 3D reconstruction of knee bones using a pair of bi-planar x-ray images. Their study demonstrated the feasibility of generating a two-channel volumetric graph with dimensions of 128×128×128, employing two orthogonal 128×128 x-ray images from lateral and anterior-posterior (AP) views. Despite their promising performance, CNN-based 3-D reconstruction models encounter two major issues. First, the inclusion of an additional dimension significantly increases the data volume during the reconstruction process. To address this issue, previous approaches often employ a huge number of convolution kernels in the initial layers, leading to an extremely huge model. Second, the CNN reconstruction model suffers from a limitation in capturing spatially extensive information across the entire images, known as the “receptive field.⁹” While this constraint is beneficial for most computer vision tasks that rely on localized information, it poses a challenge in the transition from 2-D to 3-D, where a certain part of the 3-D model lacks clear correspondence with specific areas in the 2-D images. Consequently, during the process of 3-D reconstruction, all intermediate features need to extract the information from the entire 2-D image.

We introduced a novel Visoin Transformer (ViT) based framework termed XRayWizard to reconstruct the 3-D lung model from a single 2-D chest x-ray image. Our approach leverages self -attention layers to encode the 2-D image, enabling effective feature extraction. We also incorporate a pooling layer at the end of each ViT block to enhance the extraction of valuable information. To enhance the realism and smoothness of the generated 3-D lung surface model, we employed a set of patch discriminators to get a Generative Adversarial Network (GAN) loss. To maximize the accuracy of 3D lung generation, we in particular incorporated subject demographic information as an auxiliary input, specifically one of the patches within the ViT block in the generation process.

2 |. MATERIALS AND METHODS

2.1 |. Study cohort

We collected a cohort consisting of 2525 paired 2-D chest x-ray images (scout images) and 3-D low-dose chest CT scans acquired on different subjects from an ongoing lung cancer screening program¹⁰ (Table 1). Participants were initially enrolled between 2002 and 2005, 50–79 years old, and current or former cigarette smokers with at least 12.5 pack-years at the time of enrollment. Subject exclusion criteria included: 1) quit smoking > 10 years earlier, 2) history of lung cancer, or 3) chest CT within 1 year of enrollment. Subject information was collected at baseline and follow-up visits using structured interviews and questionnaires. Participants underwent low-dose computed tomography (LDCT) screening at baseline and follow-up. Only the CT scans with corresponding scout images were used in the study. This study was approved by the University of Pittsburgh Institutional Review Board (IRB # 21020128).

TABLE 1.

Subject demographics (n = 2525).

Characteristic	Overall	Training	Validation	Testing

Count	2525	1800	200	525
Age (year), mean (SD)	57.4 (10.8)	57.4 (11.0)	57.3 (11.7)	57.3 (9.8)
Sex
Female, n (%)	1190 (47.1)	851 (47.3)	80 (40)	259 (49.3)
Male, n (%)	1335 (52.9)	949 (52.7)	120 (60)	266 (50.7)
Height (cm), mean (SD)	169.4 (9.4)	169.2 (9.4)	170 (9.6)	169.2 (9.4)
Weight (kg), mean (SD)	83.0 (18.3)	84.8 (18.8)	84.8 (18.8)	82.6 (19.4)
Smoking
Former, n (%)	1031 (40.8)	735 (40.8)	86 (43)	210 (40)
Current, n (%)	1494 (59.2)	1065 (59.2)	114 (57)	315 (60)

Open in a new tab

All chest CT scans were performed on a General Electric (GE) scanner without radiopaque contrast with the participants in a supine position and holding their breath at end-inspiration. The CT data were acquired using the helical technique at a low radiation exposure (40 mAs) without tube current modulation. Contiguous CT images were reconstructed using GE’s lung kernel at a 2.5 mm thickness. Planar scout images were acquired to set the frame of reference for the helical CT scan. In our cohort, the scout x-ray images had a consistent matrix of 888×733 and a pixel size of 0.5968×0.5455 mm. The cohort was randomly divided into: (1) 1800 training set, (2) 200 validation set, and (3) 525 independent test set.

2.2 |. XRayWizard

2.2.1 |. Model architecture

XRayWizard utilizes an encoder-decoder structure (Figure 1). During the embedding process, the 2-D chest x-ray image is divided into multiple patches and embedded. Subject demographics are incorporated as an auxiliary input (which is optional). Subject demographics undergo embedding through a linear layer and are subsequently concatenated with other embedded image patches before the addition of position information. The lung volumes depicted on the chest CT scans were segmented using our previously developed automated algorithm,¹¹ serving as the ground truth for the 3-D lungs.

XRayWizard architecture. (1) The embedding block divides the input 2-D x-ray image into patches and embeds them. Optionally, demographics can be embedded through a linear layer and are subsequently concatenated with other embedded image patches. (2) The encoder applies three ViT blocks to extract global information from the 2-D images. (3) The 2-D feature maps are transposed to a 3-D form for subsequent reconstruction. (4) The decoder generates 3-D lung volume using a set of deconvolution blocks, which are depicted by colors and legends (the numbers indicate the kernel size). (5) Binary Cross Entropy (BCE) loss and Generative Adversarial Networks (GAN) loss are computed based on the generated lung volume and the ground truth lung volume obtained from CT scans.

The encoder stage consists of three consecutive ViT blocks. Each ViT block receives embedded patches as input. Additionally, a dropout layer is applied to the feature maps, which are then fed into multi-head attention layers. These attention layers extract relevant latent information from the feature maps. The feedforward layer can be considered an advanced full connection layer that incorporates active functions and dropouts to integrate the feature information effectively.

To ensure consistent dimensions in the output 3-D model, the feature maps output by the encoder were transposed and then fed into the decoder. The decoder comprises a series of 3-D deconvolutions, progressively expanding the features from a size of 1024×4×4×4 to 64×64×64×64. Finally, the last layer employs a 3-D convolution operation with a kernel size of 1×1×1, integrating the features into a size of 1×64×64×64, which corresponds to the dimensions of the 3-D lung model.

Both the generated lungs and the ground truth 3-D lungs are used as inputs to calculate the loss. The loss function enables the optimizer to compute the gradient of each parameter in the model and update them iteratively. This iterative process aims to enhance the realism and similarity of the generated 3-D model to the ground truth, leading to improved accuracy and fidelity.

2.2.2 |. Self-attention

Self -attention is a fundamental mechanism employed in ViT-based models to capture contextual relationships among various elements in a sequence.¹² Specifically, given an input feature tensor $X \in R^{H \times W \times C_{in}}$ , where $C_{in}$ denotes the channels of the input tensor, H the height, and W the width. The output of the attention layer is $Y \in R^{H \times W \times C_{out}}$ , which is computed by the following formula:

Y = A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q K}^{⊤}}{\sqrt{d_{k}}} V)

(1)

where queries $Q = {X W}_{Q}$ , keys $K = {X W}_{K}$ , values $V = {X W}_{V}$ can be computed from $X . W_{Q}, W_{K}, W_{V} \in R^{C_{in} * C_{out}}$ are learnable parameters. $d_{k}$ is equal to the second dimension of $K$ and is used to ensure a smoother and more robust SoftMax function. During the computation process, each element in the output tensor $Y$ gains access to global information from the input feature tensor $X$ . In practice, we employed multi-head self-attention layers that consist of several set of learnable weight matrices, denoted as $W_{Q}^{(i)}, W_{K}^{(i)}, W_{V}^{(i)}$ . These weight matrices enable the generation of multiple output tensors, denoted as $Y^{(i)}$ , which are subsequently concatenated to form the final output tensor $Y$ . Each $Y^{(i)}$ focuses on different aspects of information, allowing the model to capture diverse types of information and enhance its robustness. The multi-head self-attention layer serves as the central component within a VIT block, facilitating the model’s ability to capture long-range dependencies and extract relevant features from the input tensor.^12–14

2.2.3 |. Vision Transformer (ViT)

In contrast to CNN models that utilize convolutional layers to process local image patches, the ViT model takes a different approach to image processing by treating images as a sequence of flattened patches.¹⁵ The input image is divided into a grid of patches of a fixed size (Figure 1). These patches are then linearly embedded into a sequence of patch embeddings. In our implementation, we used embedded patches with a dimension of 256×256. These patch embeddings, along with positional encodings, are subsequently fed into a ViT architecture. The self -attention mechanisms enable the model to capture global contextual relationships within the patch sequence, while the feedforward neural networks incorporate non-linear transformations to further process the encoded information.

2.2.4 |. Auxiliary patch

Subject demographics were integrated into the model by embedding the information within the same dimension as the other image patches. The embedded demographic data is then combined with the image patches and fed into the self -attention layers. The purpose of this embedding process is to align the dimensions of the input demographics with the patch embeddings obtained from the image. A straightforward linear layer was found to be an effective implementation to achieve satisfactory results.

2.2.5 |. Loss function

Let $X \in R^{2}$ denote the input 2-D image, $\hat{Y} \in R^{3}$ represents the ground truth of the 3-D lung, and $Y \in R^{3}$ represents the output 3-D lung. The loss function employed in our approach is defined as follows:

L (X, Y, \hat{Y}) = 𝓛_{B C E} (Y, \hat{Y}) + λ 𝓛_{G A N_{G}} (X, \hat{Y})

(2)

This loss function comprises two components: Binary Cross Entropy (BCE) loss and the generator loss of Generative Adversarial Networks (GAN).¹⁶ The hyperparameter $λ$ is used to balance the influence of the two loss functions. The BCE loss contributes to making the generated 3-D lung model resemble the ground truth lungs:

𝓛_{B C E} (Y, \hat{Y}) = - \frac{1}{n} \sum y_{i} * l o g ({\hat{y}}_{i}) + (1 - y_{i}) * l o g (1 - {\hat{y}}_{i})

(3)

where $y_{i}$ and ${\hat{y}}_{i}$ represent the element of $Y$ and $\hat{Y}$ , respectively, and n denotes the total number of elements in $y_{i}$ . As for the generator loss, considering the large data volume in the 3-D lung images, we use 64 patch discriminators of size 16×16×16.¹⁷ Each patch discriminator functions similarly to the naïve conditional discriminator, distinguishing between real and fake patches:

𝓛_{G A N_{G}} (X, \hat{Y}) = \frac{1}{N} \sum l o g (1 - D_{i} (\hat{Y} ∣ X))

(4)

𝓛_{G A N_{D_{i}}} (X, Y, \hat{Y}) = l o g (D_{i} (\hat{Y} ∣ X)) + l o g (1 - D_{i} (Y ∣ X))

(5)

where $D_{i}$ represents one of the patch discriminators, and N is the total number of discriminators. We utilized the discriminator $𝓛_{G A N_{D_{i}}}$ to train the patch discriminators. However, we only include the generator loss $𝓛_{G A N_{G}}$ in the final loss function of our model.

During the training process, the hyperparameter $λ$ plays a crucial role in balancing and coordinating the two loss functions. It determines the relative importance or weight assigned to each loss function in the training of the model. The parameter $λ$ was set to [0.1, 0.01, 0.001] during training.

2.3 |. Model training

Random affine transforms and random perspective transforms were used for data augmentation to improve the generalization and mitigate overfitting. Specifically, the random affine transformation introduced variations in the input data through a combination of rotation (−5.0 to 5.0 degrees), translation (0.0 to 0.1 horizontal and vertical), and scaling (0.9 to 1.1). The random transformations made the model more robust to input image data at different orientations, positions, and sizes. The random perspective transformation introduced a controlled amount of perspective distortion to the data. This distortion, with a distortion scale set to 0.1, provided variability by altering the perspective of the objects within the image.

A batch size of 1 was utilized to minimize memory usage during training. The Adam optimizer was used to update the model parameters through gradient backpropagation. To ensure consistency and facilitate fair and meaningful comparisons, a constant learning rate of 0.0001 was used, and the models were trained for a fixed number of 100 epochs in each run. This approach guarantees that all models undergo the same learning schedule and have an equal opportunity to converge and optimize the objective function throughout the training.

The patch discriminators for the XRayWizard were trained simultaneously by the Adam optimizer with a constant learning rate of 0.0001.In each epoch, one real lung and one fake lung were used for the training of the patch discriminators. All training processes were executed on an NVIDIA TITAN Xp GPU, utilizing the PyTorch 1.12.1 framework.

2.4 |. Performance evaluation

Considering that the 3-D lung volumes are represented as binary tensors, we utilized the Dice coefficient (Equation 6) as the primary metric to evaluate the performance of the algorithm in reconstructing 3-D lung surfaces. The Dice coefficient quantifies the similarity between the generated lung surface model and the ground truth, with higher values indicating a closer resemblance.

Dice coefficient (X, Y) = \frac{2 | X \cap Y |}{| X | + | Y |}

(6)

where $X$ is the ground truth (i.e., the lung volume obtained from a chest CT scan) and $Y$ is the reconstructed lung volume.

Ablation experiments were performed to evaluate the influence of ViT blocks, pooling layers, and patch discriminators on the performance of the algorithm. To validate the superiority of our encoder model, we maintained the same decoder structure across all experiments to ensure consistent output feature map size from the encoders. Specifically, we used the CNN model based on the source codes by Shen et al.,⁶ while the ViT model comprises three layers of ViT blocks without pooling layers and patch discriminators. In particular, we also studied the impact of demographic factors on the reconstruction results. Five demographics were considered, including height, smoking status, gender, age, and weight.

3 |. RESULTS

Table 2 presents a summary of the performance of each model on 525 different lung generation processes in the independent test set. As a baseline, the average Dice coefficient of the agreement between the ground truth lung models and the lung shapes generated by the CNN model was 0.688 ± 0.101. In comparison, ViT and our model achieved higher Dice coefficients, with values of 0.713 ± 0.095 and 0.738 ± 0.091, respectively (p < 0.0001). These results indicate that ViT-based models outperform the CNN-based model in terms of Dice coefficients, showcasing the impact of ViT blocks in generating 3-D shapes with improved quality and precision. Moreover, our model demonstrated superior performance compared to the vanilla ViT model but with significantly fewer parameters than the CNN model (Table 2), indicating its efficiency and computational advantages.

TABLE 2.

Quantitative results of different models.

Model	ViT	Pooling layer	GAN	Dice coefficient↑	Parameters

CNN	X	X	X	0.688 ± 0.101	247.58M
ViT	√	X	X	0.713 ± 0.095	49.76k
XRayWizard	√	√	√	0.738 ± 0.091	84.71k

Open in a new tab

Figure 2 presents two sets of results generated by three different models for a visual comparison. It is evident from these figures that both the ViT-based model and our proposed model outperform the CNN-based model in terms of generating lung shapes that closely resemble the ground truth. The lung shapes produced by the ViT-based and our model exhibit enhanced fidelity and capture the intricate details and structural characteristics of the lungs more accurately. Notably, the lung surfaces generated by the ViT-based model appear remarkably smooth, showcasing its ability to produce coherent and well-defined structures. In contrast, our model introduces more texture and intricate details, resulting in a more realistic appearance of the generated lungs. This observation emphasizes the capability of our model to capture fine-grained features and further enhance the realism of the generated lung structures.

An example of a demonstration of lungs generated by different models. (a) The input 2-D x-ray image. (b,c) The ground truth. (d,e) The 3-D lung generated by CNN, with a Dice coefficient of 0.672. (f,g The 3-D lung generated by ViT, with a Dice coefficient of 0.733. (h,i) The 3-D lung generated by XRayWizard, with a Dice coefficient of 0.759.

We applied different $λ$ to train XRayWizard, and the results were shown in Table 3. When $λ$ was set to 0.01, the result reached the best value.

TABLE 3.

The influence of hyperparameter λ on reconstruction performance.

λ	0.1	0.01	0.001
Dice coefficient	0.715 ± 0.150	0.738 ± 0.091	0.733 ± 0.089

Open in a new tab

Figure 3 depicts the distribution of Dice coefficients for the lungs generated by different models in the form of box plots, showing the variations in the performance of the models across different subjects. The Dice coefficients exhibited discrepancies for certain subjects, deviating significantly from the average performance of the models. A specific instance highlighting subpar performance was visualized in Figure 4.

Box plots of the distribution of Dice coefficients of the lungs generated by different models.

A specific case with poor performance featuring a female subject with a weight of 112.0 kg. (a) The input 2-D x-ray image. (b) The ground truth lung volumes obtained from the CT scan. (c) The 3-D lung volumes generated by XRayWizard with a dice coefficient of 0.357.

Table 4 shows the impact of incorporating different demographic factors on the results. By incorporating all demographic information, the Dice coefficient increased to 0.769 ± 0.089, indicating a significant enhancement in the accuracy and resemblance of the generated lung models to the ground truth (p < 0.001). Additionally, the volume percentage error was reduced to 0.157 ± 0.029, demonstrating a notable decrease in the discrepancy between the generated lung volumes and actual volumes.

TABLE 4.

The performance of XRayWizard with the inclusion of different demographic information.

Demographics	Dice coefficient↑	Volume error (%)

None	0.738 ± 0.091	23.5 ± 2.7
Height	0.752 ± 0.102	21.4 ± 2.9
Smoke	0.758 ± 0.149	20.3 ± 3.3
Gender	0.759 ± 0.092	21.1 ± 3.1
Age	0.766 ± 0.095	17.7 ± 3.4
Weight	0.767 ± 0.091	16.5 ± 2.6
All	0.769 ± 0.089^*	15.7 ± 2.9

Open in a new tab

p < 0.0001

“None” indicates that no demographic information was included in the model, while “All” represents the inclusion of all the demographic features listed.

Figure 5 illustrates four extreme instances, including a female with a low BMI (18.1), a male with a low BMI (18.0), a female with a high BMI (42.7), and a male with a high BMI (42.0). In the figure, the first column represents the input x-ray images, while the second column displays the ground truths. The third column showed the 3-D lungs generated by XRayWizard without incorporating demographics, with corresponding Dice coefficients of [0.792, 0.788, 0.694, 0.702]. The fourth column presented the 3-D lungs generated by XRayWizard with all demographic inputs, yielding Dice coefficients of [0.805, 0.792, 0.701, 0.721].

Examples of different genders and body mass index (BMI). (a) A female with a low BMI (18.1). (b) A male with a low BMI (18.0). (c) A female with a high BMI (42.7). (b) A male with a high BMI (42.4). Column 1: The input 2-D x-ray image. Column 2: The ground truths. Column 3: The 3-D lungs generated by XRayWizard with no demographics, with Dice coefficients of 0.792, 0.788, 0.694, and 0.702, respectively. Column 4: The 3-D lungs generated by XRayWizard with all demographics, with Dice coefficients are 0.805, 0.792, 0.701, and 0.721, respectively.

Figure 6 shows the performance of the models when the subjects were grouped by gender. According to this figure, our model performs better on male subjects and has relatively poorer performance on female subjects.

The performance of XRayWizard in terms of Dice coefficients in reconstructing lung volumes with different demographics when grouped by gender.

When grouping the subjects based on their body mass index (BMI), the Dice coefficients of the lungs generated by each model are visualized in Figure 7. It demonstrated that our model performed better on subjects with low BMI and had poorer performance on subjects with high BMI. The inclusion of demographics improved the performance for each category of subjects.

Visualization of Dice coefficients of lungs generated by XRayWizard with different demographics when grouped by body mass index.

4 |. DISCUSSION

We developed and validated a ViT-based model called XRayWizard to reconstruct 3-D lung volumes from a single 2-D chest x-ray image. Our motivation is to enrich 2-D chest X-ray images with 3D characteristics, enabling a more intuitive interpretation of these images and ultimately enhancing their diagnostic potential. To validate this concept, we focused on tackling a relatively manageable problem at this moment, namely the reconstruction of 3-D lung volumes from 2-D chest X-ray images. We acknowledge that a Dice coefficient of approximately 0.75 may not be exceptionally high, but it does demonstrate the feasibility of reconstructing 3D lung surfaces from 2D chest x-ray images. The novelty of this study is the utilization of ViT for this specific 2D-to-3D reconstruction task. Our experiments showed a significant improvement compared to CNN-based models (p < 0.001). Nevertheless, to demonstrate the potential clinical utility of the developed algorithm in practice, an observer study is needed; however, this is currently beyond the scope of this study.

Our approach leverages the self -attention mechanism in the encoder layers of the model to enable the capture of global information and long-range relationships, thereby improving the quality of 3-D reconstruction. Despite the promising performance, we do not have an exact idea about the underlying mechanism that enables the inference of 3D information from 2D images. One potential explanation could be that the relative image intensity or attenuation of a specific region (or pixel) in relation to its adjacent areas might have implicit information about its depth, which could be effectively explored by deep learning, such as the proposed ViT model. Results indicate that the ViT model exhibits better model convergence and requires a smaller model size compared to CNN models (Table 2). To enhance the feature extraction capability, we introduced pooling layers after each L block.¹⁸ These pooling layers effectively aggregate features from different parts of the 2-D images, contributing to improved performance. Additionally, we incorporated a set of patch discriminators in our model, leveraging GAN loss to enhance the surface quality of the generated 3-D lung models. Experimental results demonstrated the significance of the introduced pooling layer and discriminators. The lung surfaces generated by ViT appear smoother, lacking the intricate details and texture observed in the ground truth (Figure 2f,g). In contrast, our model generated lung surfaces with more pronounced texture and fine details, resulting in a more realistic representation of actual lung structures (Figure 2h,i). During our experiments, we observed that the generator loss encouraged the model to generate lung shapes that closely resemble real lungs, capturing important details and ensuring higher fidelity.

Figure 3 reveals the presence of some cases with poor performance. Upon investigating these specific subjects (e.g., Figure 4), we discovered that their weights were considerably higher than the average. This observation prompted us to explore the incorporation of demographics into our model. The auxiliary patch method provides a straightforward and effective approach to integrating demographic information with image features in our model. Aligning the demographic information with the image features enables a seamless integration within the model architecture. One notable advantage of our method is its flexibility in accommodating different numbers of input demographics. Adjusting the input size of the linear layer allows us to efficiently handle varying numbers of demographic features without requiring extensive modifications to the overall model architecture. This flexibility is particularly valuable as it enables us to adapt our approach to different datasets or scenarios where the available demographic information may differ. By simply modifying the size of the linear layer, we can incorporate a different number of demographic features into our model without compromising its performance or requiring significant adjustment. By including the demographic embeddings in the input sequence for self -attention, our model can effectively capture the interactions between the image patches and the demographic information. This technique allows the model to attend to both the visual features and the relevant demographic factors when generating the lung volumes. Through the self -attention mechanism, the model learns contextual relationships between the image patches and the demographic embeddings, facilitating a more informed and comprehensive generation process.

Experiments indicate that the inclusion of all demographics improves performance (Table 4). Among all demographics, height has the least impact, possibly because the model can extract height-related information directly from the chest x-ray image. Thus, including height as a demographic feature has a limited effect on performance improvement. Our model also demonstrated less effectiveness on female subjects and subjects with high BMI (Figure 5–7). For females, the inferior performance could be attributed to the chest obscuring the lungs in 2-D images (Figure 5a,c). Additionally, for subjects with a higher BMI, excessive fat makes the position of the lungs more random in the images, which may make it more difficult to accurately capture and represent the underlying lung structure (Figure 5c,d).

While our results are encouraging, it is important to acknowledge several potential limitations. First, to ensure that the 2-D and 3-D images are acquired largely at the same time, the scout x-ray images acquired along with CT scans were used. It is worth noting that the dynamic nature of lung respiration may introduce inconsistencies between the respiratory state depicted in the 2-D scout x-ray image and the corresponding 3-D chest CT scan.¹⁹ Despite this challenge, our developed model demonstrated a promising ability to reconstruct 3-D volumes from 2-D chest x-ray, supported by the experiments on an independent test set (Table 4). Hence, we believe that our results demonstrate the feasibility of this approach, and we anticipate that applying our method to other 2-D x-ray images will be likely to yield similar results, but experiments are needed to verify this. Second, the dataset used in this study was derived from lung cancer screening with an age of at least 50 years old, and all subjects have a history of smoking. This may affect the generalization of the developed model to other populations (e.g., young and healthy subjects). In the future, we plan to extend and optimize this approach for other similar applications, such as the reconstruction of 3D pneumonic regions or lung tumors from 2D chest x-ray images, to explore the potential clinical utilities of this 2D-to-3D reconstruction technology.

5 |. CONCLUSION

We have introduced a ViT-based model that overcomes the limitations of CNN models in reconstructing 3-D lungs from a single 2-D x-ray image. Our model utilizes self -attention layers to capture global information and long-range relationships, resulting in improved 3-D reconstruction performance. By using fewer self -attention layers to replace convolution layers, our model achieves faster convergence and reduced parameters or model size. We also incorporated a pooling layer and patch discriminators to enhance the surface quality of generated 3-D models., making them smoother and more realistic. Experiments on a practical 2D-3D pair lung dataset demonstrated the superiority of our model compared to CNN-based models, confirming the effectiveness of our approach. Additionally, we have introduced a novel concept called auxiliary patch to integrate demographics as auxiliary information for the reconstruction process. Overall, our model offers a promising solution for efficient and accurate 3-D lung reconstruction from 2-D x-ray images.

ACKNOWLEDGMENTS

This work is supported in part by research grants from the National Institutes of Health (NIH) (R01CA237277, U01CA271888, R61AT012282, and P30CA047904) and UPMC Hillman Developmental Pilot Program.

Funding information

the National Institutes of Health (NIH), Grant/Award Numbers: R01CA237277, U01CA271888, R61AT012282, P30CA047904; UPMC Hillman Developmental Pilot Program

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

Data available upon request.

REFERENCES

1.Upchurch CP, Grijalva CG, Wunderink RG, et al. Community-acquired pneumonia visualized on CT scans but not chest radiographs: pathogens, severity, and clinical outcomes [published online ahead of print 2017/08/15]. Chest. 2018;153(3):601–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Garin N, Marti C, Scheffler M, Stirnemann J, Prendki V. Computed tomography scan contribution to the diagnosis of community-acquired pneumonia [published online ahead of print 2019/02/08]. Curr Opin Pulm Med. 2019;25(3):242–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kitazawa T, Yoshihara H, Seo K, Yoshino Y, Ota Y. Characteristics of pneumonia with negative chest radiography in cases confirmed by computed tomography [published online ahead of print 2020/03/05]. J Community Hosp Intern Med Perspect. 2020;10(1):19–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Self WH, Courtney DM, McNaughton CD, Wunderink RG, Kline JA. High discordance of chest x-ray and computed tomography for detection of pulmonary opacities in ED patients: implications for diagnosing pneumonia [published online ahead of print 2012/10/23]. Am J Emerg Med. 2013;31(2):401–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Prendki V, Scheffler M, Huttner B, et al. Low-dose computed tomography for the diagnosis of pneumonia in elderly patients: a prospective, interventional cohort study [published online ahead of print 2018/04/14]. Eur Respir J. 2018;51(5). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng. 2019;3(11):880–888. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Henzler P, Rasche V, Ropinski T, Ritschel T. Single-image tomography: 3D volumes from 2D cranial X-rays. Paper presented at: Computer Graphics Forum. 2018. [Google Scholar]
8.Kasten Y, Doktofsky D, Kovler I. End-to-end convolutional neural network for 3D reconstruction of knee bones from bi-planar X-ray images. Paper presented at: International Workshop on Machine Learning for Medical Image Reconstruction. 2020. [Google Scholar]
9.Luo W, Li Y, Urtasun R, Zemel R. Understanding the effective receptive field in deep convolutional neural networks. Adv Neural Inform Process Syst. 2016;29. [Google Scholar]
10.Wilson DO, Weissfeld JL, Fuhrman CR, et al. The Pittsburgh Lung Screening Study (PLuSS): outcomes within 3 years of a first computed tomography scan [published online ahead of print 2008/07/19]. Am J Respir Crit Care Med. 2008;178(9):956–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Pu J, Roos J, Yi CA, Napel S, Rubin GD, Paik DS. Adaptive border marching algorithm: automatic lung segmentation on chest CT images [published online ahead of print 2008/06/03]. Comput Med Imaging Graph. 2008;32(6):452–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inform Process Systems. 2017:30. [Google Scholar]
13.Sheng L, Wang W, Shi Z, Zhan J, Kong Y, Brainnetformer: decoding Brain Cognitive States with Spatial-Temporal Cross Attention. Paper presented at: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. [Google Scholar]
14.Voita E, Talbot D, Moiseev F, Sennrich R, Titov I. Analyzing multi-head self -attention: specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:190509418. 2019. [Google Scholar]
15.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020. [Google Scholar]
16.Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Magazine. 2018;35(1):53–65. [Google Scholar]
17.Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. Paper presented at: Proceedings of the IEEE conference on computer vision and pattern recognition 2017. [Google Scholar]
18.Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ. Rethinking spatial dimensions of vision transformers. Paper presented at: Proceedings of the IEEE/CVF International Conference on Computer Vision 2021. [Google Scholar]
19.Pu J, Sechrist J, Meng X, Leader JK, Sciurba FC. A pilot study: quantify lung volume and emphysema extent directly from two-dimensional scout images. Med Phys. 2021;48(8): 4316–4325. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data available upon request.

[R1] 1.Upchurch CP, Grijalva CG, Wunderink RG, et al. Community-acquired pneumonia visualized on CT scans but not chest radiographs: pathogens, severity, and clinical outcomes [published online ahead of print 2017/08/15]. Chest. 2018;153(3):601–610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Garin N, Marti C, Scheffler M, Stirnemann J, Prendki V. Computed tomography scan contribution to the diagnosis of community-acquired pneumonia [published online ahead of print 2019/02/08]. Curr Opin Pulm Med. 2019;25(3):242–248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Kitazawa T, Yoshihara H, Seo K, Yoshino Y, Ota Y. Characteristics of pneumonia with negative chest radiography in cases confirmed by computed tomography [published online ahead of print 2020/03/05]. J Community Hosp Intern Med Perspect. 2020;10(1):19–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Self WH, Courtney DM, McNaughton CD, Wunderink RG, Kline JA. High discordance of chest x-ray and computed tomography for detection of pulmonary opacities in ED patients: implications for diagnosing pneumonia [published online ahead of print 2012/10/23]. Am J Emerg Med. 2013;31(2):401–405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Prendki V, Scheffler M, Huttner B, et al. Low-dose computed tomography for the diagnosis of pneumonia in elderly patients: a prospective, interventional cohort study [published online ahead of print 2018/04/14]. Eur Respir J. 2018;51(5). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng. 2019;3(11):880–888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Henzler P, Rasche V, Ropinski T, Ritschel T. Single-image tomography: 3D volumes from 2D cranial X-rays. Paper presented at: Computer Graphics Forum. 2018. [Google Scholar]

[R8] 8.Kasten Y, Doktofsky D, Kovler I. End-to-end convolutional neural network for 3D reconstruction of knee bones from bi-planar X-ray images. Paper presented at: International Workshop on Machine Learning for Medical Image Reconstruction. 2020. [Google Scholar]

[R9] 9.Luo W, Li Y, Urtasun R, Zemel R. Understanding the effective receptive field in deep convolutional neural networks. Adv Neural Inform Process Syst. 2016;29. [Google Scholar]

[R10] 10.Wilson DO, Weissfeld JL, Fuhrman CR, et al. The Pittsburgh Lung Screening Study (PLuSS): outcomes within 3 years of a first computed tomography scan [published online ahead of print 2008/07/19]. Am J Respir Crit Care Med. 2008;178(9):956–961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Pu J, Roos J, Yi CA, Napel S, Rubin GD, Paik DS. Adaptive border marching algorithm: automatic lung segmentation on chest CT images [published online ahead of print 2008/06/03]. Comput Med Imaging Graph. 2008;32(6):452–462. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inform Process Systems. 2017:30. [Google Scholar]

[R13] 13.Sheng L, Wang W, Shi Z, Zhan J, Kong Y, Brainnetformer: decoding Brain Cognitive States with Spatial-Temporal Cross Attention. Paper presented at: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. [Google Scholar]

[R14] 14.Voita E, Talbot D, Moiseev F, Sennrich R, Titov I. Analyzing multi-head self -attention: specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:190509418. 2019. [Google Scholar]

[R15] 15.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020. [Google Scholar]

[R16] 16.Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: an overview. IEEE Signal Process Magazine. 2018;35(1):53–65. [Google Scholar]

[R17] 17.Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. Paper presented at: Proceedings of the IEEE conference on computer vision and pattern recognition 2017. [Google Scholar]

[R18] 18.Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ. Rethinking spatial dimensions of vision transformers. Paper presented at: Proceedings of the IEEE/CVF International Conference on Computer Vision 2021. [Google Scholar]

[R19] 19.Pu J, Sechrist J, Meng X, Leader JK, Sciurba FC. A pilot study: quantify lung volume and emphysema extent directly from two-dimensional scout images. Med Phys. 2021;48(8): 4316–4325. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

XRayWizard: Reconstructing 3-D lung surfaces from a single 2-D chest x-ray image via Vision Transformer

Zhiyi Shi

Kaiwen Geng

Xiaoyan Zhao

Farhad Mahmoudi

Christopher J Haas

Joseph K Leader

Emrah Duman

Jiantao Pu

Abstract

Background:

Purpose:

Methods:

Results:

Conclusion:

1 |. INTRODUCTION

2 |. MATERIALS AND METHODS

2.1 |. Study cohort

TABLE 1.

2.2 |. XRayWizard

2.2.1 |. Model architecture

FIGURE 1.

2.2.2 |. Self-attention

2.2.3 |. Vision Transformer (ViT)

2.2.4 |. Auxiliary patch

2.2.5 |. Loss function

2.3 |. Model training

2.4 |. Performance evaluation

3 |. RESULTS

TABLE 2.

FIGURE 2.

TABLE 3.

FIGURE 3.

FIGURE 4.

TABLE 4.

FIGURE 5.

FIGURE 6.

FIGURE 7.

4 |. DISCUSSION

5 |. CONCLUSION

ACKNOWLEDGMENTS

Funding information

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases