Skip to main content
PLOS One logoLink to PLOS One
. 2025 Aug 22;20(8):e0330463. doi: 10.1371/journal.pone.0330463

TomoGRAF: An X-ray physics-driven generative radiance field framework for extremely sparse view CT reconstruction

Di Xu 1, Yang Yang 2, Hengjie Liu 3, Qihui Lyu 1, Martina Descovich 1, Dan Ruan 3, Ke Sheng 1,*
Editor: Zhentian Wang4
PMCID: PMC12373210  PMID: 40845061

Abstract

Objectives

Computed tomography (CT) provides high spatial-resolution visualization of 3D structures for various applications. Traditional analytical/iterative CT reconstruction algorithms require hundreds of angular samplings, a condition may not be met practically for physical and mechanical limitations. Sparse view CT reconstruction has been proposed using constrained optimization and machine learning methods with varying success, less so for ultra-sparse view reconstruction. Neural radiance field (NeRF) is a powerful tool for reconstructing and rendering 3D natural scenes from sparse views, but its direct application to 3D medical image reconstruction has been minimally successful due to the differences in photon transportation and available prior information between optic and X-ray.

Methods

We develop TomoGRAF to reconstruct high-quality 3D CT volumes using ultra-sparse projections. TomoGRAF has two main novelties pertinent to X-ray physics and CT imaging. First, TomoGRAF’s volume rendering module accumulates x-ray material attenuation passing through an object with CT geometry rather than visible light material color and opacity from surface interaction in NeRF. Second, TomoGRAF penalizes the difference between the simulated and ground truth volume during training besides the 2D views, thus significantly improving the prior fidelity.

Results

TomoGRAF is trained on LIDC-IDRI dataset (1011 scans) and evaluated on an unseen in-house dataset (100 scans) of distinct imaging characteristics from training and demonstrates a vast leap in performance compared with state-of-the-art deep learning and NeRF methods.

Conclusion

TomoGRAF provides the first generalizable solution for image-guided radiotherapy and interventional radiology applications, where only one/a few X-ray views are available, but 3D volumetric information is desired.

1. Introduction

Computed tomography (CT) acquires x-ray projections around the subject to generate 3D cross-sectional images. Compared to 2D radiographs, where the depth information along the ray direction is lost and structures superimposed, CT enables the 3D representation of rich internal information for quantitative structure characterization. Analytically, Tuy’s data sufficiency condition covering a sufficient sampling trajectory is required for mathematical rigid reconstruction [1]. Violating Tuy’s condition leads to geometry and intensity distortion in the reconstructed images (referred to as limited-angle artifacts in the following). Besides the sampling trajectory, a minimal sampling density is required to avoid streak artifacts that can severely corrupt the image with sparse, e.g., < 100 views.

On the other hand, data-sufficient conditions may not always be met due to practical limitations, including imaging dose considerations, limited gantry freedom, and the need for continuous image guidance for radiotherapy and interventional radiology [2,3]. For instance, the total ionizing radiation exposure in mammograms is kept low to protect the sensitive tissue [4]. However, 2D mammograms without depth differentiation can be inadequate with dense breast tissues. Digital breast tomosynthesis (DBT), a limited-angle tomographic breast imaging technique, was introduced to overcome the problem of tissue superposition in 2D mammography while maintaining a low dose level. In DBT, limited projection views are acquired while the X-ray source traverses along a predefined trajectory, typically an arc spanning an angular range of 60° or less. The acquired limited angle samplings are then reconstructed as the volumetric representation with improved depth differentiation but still inferior quality to CT [5,6]. In different applications, the acquisition angles are not restricted, but the density of projection is reduced to lower the imaging dose [7,8] or to capture the dynamic information in retrospectively sorted 4DCT [9,10], resulting in sparse views in each sorting bin.

Image reconstruction from sparse-view and limited-angle samplings is an ill-posed inverse problem. Due to insufficient projection angles, the conventional filtered back-projection (FBP) [11] algorithm, algebraic reconstruction technique (ART) [12], and simultaneous algebraic reconstruction technique (SART) [13] suffer from limited-angle artifacts that worsen with sparser projections. Over the past few decades, substantial effort has been made to advance the development of sparse-view CT from two general avenues. One line of research lies in developing regularized iterative methods based on the compressed sensing (CS) [14] theory. For instance, Sidky et al. proposed the adaptive steepest descent projection onto convex sets (ASD-POCS) method by minimizing the total variation (TV) of the expected CT volume from sparsely sampled views [15]. Following that, the adaptive-weighted TV (awTV) model was introduced by Liu et al. for improved edge preservation with local information constraints [16], while an improved TV-based algorithm named TV-stokes-projection onto Convex sets (TVS-POCS) was proposed immediately after to eliminate the patchy artifacts and preserving more subtle structures [17]. Apart from the TV-based methods, the prior image-constrained CS (PICCS) [18], patch-based nonlocal means (NLM) [19], tight wavelet frames [20], and feature dictionary learning [21] algorithms were introduced to further improve the reconstruction performance in representing patch-wise structure features. More recently, deep learning (DL) techniques were explored for improved CT reconstruction quality in the image or sinogram domain. The image domain methods learn the mapping from the noisy sparse-view reconstructed CT to the corresponding high-quality CT using diverse network structures such as feed-forward network [22,23], U-Net [24], and ResNet [25]. The sinogram domain methods work on improving/mapping the FBP algorithm [2629] or interpolating the missing information in the sparse-sampled sinograms [3033] with DL techniques. These and other deep learning-based sparse view CT reconstruction studies are comprehensively reviewed in Podgorsak et al. and Sun et al. [34,35].

Yet, with the exception where the same patient’s different CT was used as the prior [36], little progress was made in reconstructing high-quality tomographic imaging using less than ten projections, a practical problem in real-time radiotherapy or interventional procedures. For the former, an onboard X-ray imager orthogonal to the mega-voltage (MV) therapeutic X-ray provides the most common modality of image-guided radiotherapy (IGRT). However, a trade-off must be made between slow 3D cone beam CT (CBCT) and fast 2D X-rays [37,38]. A similar trade-off exists in interventional radiology [39]. When real-time interventional decisions need to be made in the time frame affording one to two 2D X-rays, yet 3D visualization of the anatomy is desired, a unique class of ultra-sparse view CT reconstruction problems combining extremely limited projection angles and sparsity is created.

With the advancement of deep learning and more powerful computation hardware, several recent studies proposed harnessing inversion priors through training data-driven networks for single/dual-view(s) image reconstruction. Specifically, Shen et al. designed a three-stage convolutional neural network (CNN) trained on patient-specific 4DCT to infer CT of a different respiratory phase using a single or two orthogonal view(s) [40]. Ying et al. built a generative adversarial network framework (X2CT-GAN) with a 2D to 3D CNN generator to predict tomographic volume from two orthogonal projections [41]. Though promising, their generalization and robustness to external datasets have not been demonstrated and may be fundamentally limited by two factors: First, they generate volumetric predictions purely from 2D manifold learning. As a result, these networks are incapable of comprehending the 3D world and the projection view formation process [42]. Second, a prerequisite for deep networks with complex enough parameters to implicitly represent 2D to 3D manifold mapping is large and diverse training data, a condition difficult to meet for medical imaging [43].

An effective perspective to mitigate those problems is to leverage intermediate voxel-based representation in combination with differentiable volume rendering for a 3D-aware model, which requires smaller data to generalize. The Neural Radiance Fields (NeRF) [44] model successfully implemented this principle for volumetric scene rendering. NeRF proposed synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. NeRF achieved this by representing a scene with a fully connected deep network with the input of 5D coordinates representing the spatial location, view direction, and the output of the volume density and view-dependent emitted radiance at the spatial location. The novel view was synthesized by querying 5D coordinates along the camera rays and using volume rendering techniques to project the object surface color and densities onto an image [44]. NeRF was designed to generate unseen views from the same object and typically required fixed camera positions as supervision. As an improvement, GRAF, a 3D-aware generative model 2D-supervised by unposed image patches, introduced a conditional radiance field generator trained within the Generative Adversarial Network (GAN) framework [45] that is capable of rendering views of novel objects from given sparse projection views [42].

The success of NeRF and GRAF motivated their applications to solve the 3D tomography problems. MedNeRF [46] was proposed by Corona-Figueroa et al. for novel view rendering from a few or single X-ray projections. MedNeRF inherits the general GRAF framework, remains 2D-supervised, and assumes visible-light photon transportation configuration in the generator with an addition of self-supervised loss to the discriminator. However, there are distinct differences between CT volume reconstruction and “natural object” 3D representation rendering in terms of the available choices of training supervision, imaging setup, and properties of the rays. Specifically, 3D training supervision (object mesh with information on surface color) is often hard to acquire for “natural objects.” In contrast, existing CT scans are an ideal volumetric training ground truth (GT) for fitting a 3D tomographic representation learning model. Moreover, optical raytracing works by computing a path from an imaginary camera (eye) through each pixel in a virtual screen and calculating the surface color and opacity of the object visible through the virtual screen via simulating ray reflection, shading, or refraction on the object surface (Fig 1(b)). The solving target of optical ray tracing is the object surface color (r,g,b) and density σ in a 3D location (x,y,z) Meanwhile, x-rays are transported from the focal spot and pass through an object to the detector plane, accounting for scattering and attenuation (Fig 1(c)). The goal of CT reconstruction is voxel-wise material density δ at a 3D location (x,y,z). Because of these major differences between natural scenes and 3D medical images, the direct application of NeRF has not resulted in usable CT with ultra-sparse views.

Fig 1. Architecture of the proposed TomoGRAF framework.

Fig 1

a, The illustration of energy/wavelength difference between visible lights and X-rays. b, The visualization of object imaging with visible lights in 3D. c, The visualization of object imaging with X-ray in 3D. d, The diagram visualization of the TomoGRAF pipline in the training and testing stages. e, The visualization of the training and inference stage of TomoGRAF. TomoGRAF references multi-view projections and the corresponding CT volume during training using data collected from multiple institutes (inst.) and will only require sparse-view projections referencing at the inference stage to render the predicted CT volume from a new institute. f, The pipeline of the TomoGRAF network. TomoGRAF works on projection patches and CT sub-volume in training. The input to TomoGRAF consists of source setup matrix K, view direction (pose) ξ and 2D sampling pattern v. g, The ray sampling mechanism in a patch of the projection input to TomoGRAF. u represents the center of the sampled patch, and s refers to the distance between two sampled patches. h, The design of conditional radiance field in TomoGRAF with a fully connected coordinate network (gϑ) which consists of shape encoding hϑ and density head dϑ.

To overcome the challenges in ultra-sparse view CT reconstruction while maintaining the superior NeRF 3D structure representation efficiency, we introduced an x-ray-aware tomographic volume generator, termed TomoGRAF, to simulate CT imaging setup and use CT and its projections for 3D- and 2D-supervised training. TomoGRAF is further enhanced with a GAN framework and computationally scaled with sub-volume and image patch GTs training. To the best of our knowledge, this is the first pipeline that informs the NeRF simulator with X-ray physics to achieve generalizable high-performing CT volume reconstruction with ultra-sparse projection representation.

2. Materials and methods

2.1. Problem formulation

As illustrated in Fig 1(d), we formulated the problem of 3D image reconstruction from 2D projection(s) into a GAN-based DL framework, including modules of generator G and discriminator equation, given a series of 2D projections X denoted as {X1,X2,,XN} where Xi2=H×W for i[1,N], N is the number of available 2D projections, H and W is the projection height and width. Our modeling target was to form a G that can predict the 3D volume Y^3=D×H×W (D represents volume depth) where G is supervised by X and 3D volume GT Y3=D×H×W, and is penalized by D to encourage optimal convergence.

2.2. TomoGRAF framework

TomoGRAF does not aim to optimize densely posed projections for rendering a single patient volume. Instead, it targets fitting a network for synthesizing new patient volume by learning on various unposed projections. Note that the generator and discriminator work on image patches and sub-volumes during training for better efficiency, whereas a complete patient volume is rendered at inference time. The detailed components in TomoGRAF are shown in Fig 1(fh). In what follows, we elaborate on the model architecture.

2.2.1. Generator.

Adapted from GRAF [42], Our generator consists of three main components: ray sampling, conditional generative radiance field, and projection rendering. Ray sampling modules render the x-ray paths in 3D that are associated with truth patch/sub-volume, and the conditional generative radiance field module predicts the material density from a 3D location along the rendered x-ray paths. Lastly, the projection rendering module obtains the 2D composition from the predicted volumetric material densities. Overall, the generator G takes x-ray source setup matrix K, view direction (pose) ξ=(θ,ϕ), 2D sampling pattern v, shape code zshMs and appearance code zaMa as input, and predicts a size M×M CT projection patch P2=M×M (M is a hyper-parameter defined by user; M=32 is used in our experiments) and the associated CT sub-volume V3 corrsponding to P at ξ (In orthogonal viewing, V3=RM×M×D; In non-orthognal views, V has varied dimension and is essentially a collection of points where converging rays intersecting the 3D grid). K consists of d=(d1,d2),d1 is distance of source to patient (SAD), d2 is distance of source to detector (SID). The detector resolution S2 and M is bounded by(H,W).

Ray Sampling: The rendered rays r3 are constrained by ξ, K and P. The x-ray pose ξ is sampled from a predefined pose distribution pξ collected from projection angles in training data with the x-ray source facing towards the origin of the coordinate system all the time. As shown in Fig 1(g), v=(u,s) determines the center u=(x,y)2 and scales s+ of Pthat we target to predict, where V is formed with all the corresponding 3D coordinates that form P. At the training stage, u and s are uniformly drawn from u~U(Ω) and s~U[1,S] where Ω defined the 2D projection domain and S=min(H,W)/M. Noteworthily, the coordinates in P and V are real numbers for the purpose of continuous radiance field evaluation. Following the stratified sampling approach in Mildenhall et al. [44], Q number of points are sampled along each r. The number of rays R=M×M and R=H×W per training and inference, respectively.

Conditional Generative Radiance Field: Adapted from GRAF [42], the radiance field is represented by a deep fully connected coordinate network gϑ with the input of the positional encoding of a 7D vector (x,y,z,θ,ϕ,zsh,za) consisting of 3D location x=(x,y,z) and projection pose ξ. The output is the material density δ in the corresponding x, where ϑ represents the network parameters and zsh~psh and za~pa with psh and pa drawn from standard Gaussian distribution.

gϑ:x×ξ×Msh×Ma3×+ (1)
(γ(x),γ(ξ),zsh,za)δ (2)

Where x and ξ represent the latent codes of x and ξ, Msh and Ma define the shape and appearance codes with zshMsh and zaMa, and γ(·) represents positional encoding.

The architecture of gϑ is visualized in Fig 1(h) with Equations (3–6). First, shape encoding hϑ is conducted with the input of γ(x) and zsh. Second, hϑ is concatenated with γ(ξ) and za and then sent to the density head dϑ to predict δ.

hϑ:x×MshH (3)
(γ(x),zsh)h (4)
dϑ:RH×ξ×Ma1\ \ \ \ \ \ \ \ \ \ \  (5)
(h,γ(ξ),za)δ\ \ \ \ \ \ \  (6)

where the encoding was implemented with a fully connected network with ReLU activation.

X-ray Physics Informed Projection Rendering: Lastly, given the material density {δir}=V3 where 1iQ of all points along the rays {rj} where 1jM×M in training, we used a CT projection algorithm, Siddon Ray Tracing algorithm [47], to synthetic 2D radiograph patch Pgiven preset K and ξ. In specific, Siddon algirhtm computes the path lenghs of the ray r through each interescted voxel x within the 3D grid to enable efficient ray path (line) intergral over the grid. The pseudo code for the applied Siddon algorithm is listed in S1 Appendix.

2.2.2. Discriminator.

Following the discriminator architecture defined in MedNeRF [46], two self-supervised auto-encoded CNN discriminators, [48] D1, and D2, compare predicted sub-volume V to real sub-volume V extracted from real volume Y and predict projection patch P to real projection patch P extracted from real projection I, respectively. D1, is defined to convolve in 3D while D2, is defined to convolve in 2D to align with the dimension of its discrimination targets. For extracting V and P, we first extracted P from a real projection I given v and s randomly drawn from their corresponding distributions, and then located the coordinates of V from Y based on ξ and K. D1, and D2, are backpropagated separately with their respective weights, while we defined them to share weights while discriminating different sub-volume/patch locations.

2.2.3. Supervision.

A distinctive supervision strategy is designed for the training and testing phases to better adapt to the imaging setup and available information in CT reconstructions.

During training, the model is guided using hybrid (2D and 3D) supervision. Specifically, multiple 2D subpatches P from different view angles and their paired subvolmes V are used as GTs to converge the prior. This approach enables the model to efficiently and effectively establish a trained prior that represents the universal features shared across the training cohort. The use of 3D supervision is particularly advantageous in CT imaging, as it enables the model to learn not only the external surface but also the complex internal anatomy of the object. Unlike visible light-based reconstruction methods, which are typically limited to surface information from 2D views and are incapable of exploiting the full 3D volume, TomoGRAF leverages the availability of 3D ground truth data to guide the reconstruction of deeper anatomical structures.

The inference stage can be further divided into two sub-steps, including patient-specific fine tuning and CT volume prediction. For patient-specific finetuning, the trained prior is further optimized to adapt to the incoming patient’s morphologies with supervision of the patient’s 2D sparse-view projection. At this stage, we use the complete 2D projections instead of 2D subpatches for supervision. This approach prioritizes strict maintenance of global structures and anatomical consistency during fine-tuning, albeit with a higher demand for GPU memory. Following the fine-tuning process, where the trained prior is tailored to the specific morphologies of the patient using their sparse-view 2D projections, the algorithm reconstructs the complete 3D CT volume. This reconstructed volume incorporates patient-specific anatomical details, ensuring that the final output accurately represents the unique structural characteristics of the individual.

2.2.4. Loss function.

Training stage: Our loss objective consists of discrimination towards patch as well as sub-volume predictions. First, the global structures in intermediate decoded patches of D1, and D2, were separately assessed by Learned Perceptual Image Patch Similarity (LPIPS) [49] (denoted in Equations (7–8)).

Lr,V=Efv~D1(v),v~V[1whd||i(𝒢(fv))i(𝒯(v))||2 (7)
Lr,P=Efp~D2(p),p~P[1whd||i(𝒢(fp))i(𝒯(p))||2 (8)

Where i(·) denotes the output from the ith layer of the pretrained VGG16 [50] network, fv and fprepresent the feature maps from D1, and D2,, w,hand d stands for the width, height and depth of the corresponding feature space, 𝒢 is the pre-processing on f, and 𝒯 is the processing on truth sub-volumes/patches.

Second, hinge loss was selected to classify P from P and V from V with formulas listed in Equations (9–10).

Lh,V=Ev~V[f(D1,(v))]+Ev~V[f(D1,(v))]\ \ \ \ \ \  (9)
Lh,P=Ep~P[f(D2,(p))]+Ep~P[f(D2,(p))]\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  (10)

Where f(t)=max(0,1+t).

Lastly, data augmentation, including random flipping and rotation, was implemented to V’ and P’ prior to sending into D1, and D2,, following the theory proposed by Data Augmentation Optimized for GAN (DAG) framework [51]. D1,/D2, share weights while discriminating multiple augmented sub-volumes/patches. Therefore, we have the overall loss objective formulated as Equations (11–12).

L(ϑ,{1,k},{2,k})=L(ϑ,1,0,2,0)+λ1n1k=1nL(ϑ,1,k,2,k) (11)
L(ϑ,1,k,2,k)=Lr,V,k+Lh,V,k+λ2(Lr,P,k+Lh,P,k)\ \ \ \ \ \ \ \  (12)

Where n=4, λ1=0.2 and k=0 corresponding to the identity transformation follows the definition in Trans et al [51]. λ2=0.5 to give the model more attention on conformal V rendering.

Inference Stage: A fully trained Gϑ was further fine-tuned with zshand za using the trained prior and incoming patients’ sparse view projection(s) to render the final patient-specific volumetric prediction Y^. Since we conducted moderate optimization with limited iterations, Gϑ was tuned with fully size I instead of patches. Depending on the available views, a referenced projection was randomly drawn for each iteration until Gϑ reached the convergence criteria. In our experiments, peak signal-to-noise ratio (PSNR)=25 was set as the stopping threshold. The inference loss objective is defined in Equation (13) with a combination of LPIPS, PSNR, and the negative log-likelihood loss (NLL).

LGϑ=λ1Lr,I+λ2LPSNR,I+λ3LNLL,I\ \ \ \ \ \ \ \ \ \ \ \  (13)

Where λ1=λ3=0.3 and λ2=0.1 were set in our experiments.

2.3. Implementation details

During training, the RMSprop optimizer [52] with a batch size of 4 (4×1), learning rate of 0.0005 for the generator, learning rate of 0.0001 for the discriminator, and 40000 iterations were performed. Per inference fine-tuning, RMSprop [52] optimizer with a batch size of 1, learning rate of 0.0005, stopping threshold of PNSR = 25 (mostly under 1000 iterations) was implemented towards the generator. All the experiments were carried out on a NVDIA RTX 4 × A6000 cluster.

2.4. Evaluation metrics

We evaluate the predicted CT volume Y^ and projection I^ corresponding to the reference view of our TomoGRAF generator using PSNR, structure similarity index measurement (SSIM) and rooted mean squared error (RMSE) as Equations (14–16).

PSNR=20·log10MAXIRMSE (14)
SSIM=(2μGϑμy+c1)(2σGϑy+c2)(μGϑ2+μy2+c1)(σGϑ2+σy2+c2) (15)
RMSE=i=1N||y(i)y^(i)||2N (16)

Where MAXI is the max possible pixel value in a tensor, RMSE stands for rooted mean squared error, μGϑ and μy is the pixel mean of Gϑ and y and σGϑy is the covariance between Gϑ and y, σGϑ2 and σy2 is the variance of Gϑ and y. Lastly, c1=(k1L)2 and c2=(k2L)2, where k1=0.01 and k2=0.03 in the current work and L is the dynamic range of the pixel values (2#bitsperpixel1).

2.5. Baseline algorithms

A CNN-based method X2CT-GAN [41], and a NeRF-based method MedNeRF [46] were included as our benchmarks with both the performance in projection inference and CT volume rendering compared. X2CT-GAN and MedNeRF are evaluated using the open-sourced codes and network weights released by authors, with the input of our in-house test set arranged following their data organization guidelines.

2.6. Data cohorts

1011 CT scans were selected from LIDC-IDRI [53] thoracic CT database for organizing the training set. Digital reconstructed radiographs (DRRs) were generated as projections for training supervision. Several scanner manufacturers and models were included (GE Medical Systems LightSpeed scanner models, Philips Brilliance scanner models, Siemens Definition, Siemens Emotion, Siemens Sensation, and Toshiba Aquilion). The tube peak potential energies for scan acquisition include 120 kV, 130 kV, 135 kV, and 140 kV. The tube current is in the range of 40–627 mA. Slice thickness includes 0.6 mm, 0.75 mm, 0.9 mm, 1.0 mm, 1.25 mm, 1.5 mm, 2.0 mm, 2.5 mm, 3.0 mm, 4.0 mm and 5.0 mm. SAD includes 595 mm, 541 mm, 570 mm and 535 mm with corresponding SID of 1085.6 mm, 949.1 mm, 1040 mm and 940 mm, respectively. The in-plane pixel size ranges from 0.461 to 0.977 mm [53]. 72 DRRs that cover a full 360° (generated each of 5° rotations) vertical rotations were generated for each scan. All the DRRs and CT volumes were black-border cropped out. DRRs were resized with a resolution of 128 times 128, and CT volumes were interpolated with a resolution of ×H×W=128×128×128 for model learning preparation. All the data were patient-wise normalized to [0, 1].

The test data was organized under IRB approval (IRB # 20–32527), entitled “Image-guided radiation therapy”, which include 100 de-identified CT scans from lung cancer patients who underwent robotic radiation therapy. Informeed consent was not required for the retrospective imaging study. All patients were scanned by Siemens Sensation with tube peak potential energy of 120 kV, tube current of 120 mA, slice thickness of 1.5 mm, and in-plane pixel size of 0.977 mm. SAD and SID are of 570 mm and 1040 mm. The anterior-posterior (AP) and lateral views were generated for inference reference, with 1-view-based inference solely referencing the AP view projection. All the DRRs and CT volumes had the black border cropped out. Additionally, DRRs were resized with a resolution of 128×128, and CT volumes were interpolated with a volume size of 128×128×128. All the data were patient-wise normalized to [0, 1] prior to being fed into models for inference.

2.7. Model performance as a function of the number of views and 3D supervision

The model baseline performance was established using a single AP view. 1, 2, 5, and 10 view reconstructions were also performed to determine the model performance. The view angles are specified as follows: for 1-view-based reconstruction, the AP view is used for referencing. For 2-view-based reconstruction, the AP and lateral views are used for inferencing. For 5-view-based reconstruction, a full 360° is covered with rotation every 72°, starting from the AP view. For 10-view-based reconstruction, a full 360° is covered with rotation of every 36°, starting from the AP view.

3. Results

The results of TomoGRAF, MedNeRF, and X2CT-GAN with 1 or 2 views for 2D projection and volume rendering are visually demonstrated in Figs 2 and 3, with accompanying statistics reported in Table 1 and Fig 4. TomoGRAF consistently outperforms MedNeRF and X2CT-GAN in both tasks with the most evident advantage in 1-view volume reconstruction.

Fig 2. Lung projection rendering in four 360°-clockwise-rotated (visualized every 90° of rotation) views from a patient in a test set.

Fig 2

Projections from X2CT-GAN are generated by applying the DRR synthesis algorithm on predicted CT volumes since the original X2CT-GAN was only designed to predict the CT volume. TomoGRAF and MedNeRF directly predicted the 2D projections. Results rendered by referencing 1-view and 2-views are both visualized. All the images are shown with a normalized window of [0, 1].

Fig 3. CT Reconstruction results for two representative patients in the test set.

Fig 3

Only the residual maps of the TomoGRAF are shown, as the two comparison methods result in a large mismatch with GT. The images are visualized in a lung window with (a Hounsfield Unit width and level) of (1500 −600). The red boxes denote the lung tumors, while the arrows point to the pacemaker in patient 1.

Table 1. Statistical results evaluated on the test set. The best results from each metric are underscored. ↑ indicates the higher the statistical value, the better, and vice versa for ↓. SSIM and PSNR are calculated with images normalized to [0,1] scales and RMSE are calculated based on Hounsfield Units (HUs). 0.31 ± 0.12.

Modality 1-View 2-Views
SSIM↑ PSNR(dB)↑ RMSE(HU)↓ Inference Time (s)↓ SSIM↑ PSNR(dB)↑ RMSE↓ Inference Time (s)↓
CT Volume X2CT-GAN 0.31 ± 0.12 14.39 ± 0.19 386.69 ± 27.43 0.27 0.48 ± 0.06 17.35 ± 0.21 347.76 ± 25.46 1.31
MedNeRF 0.37 ± 0.08 7.68 ± 0.10 321.87 ± 22.87 527.78 ± 15.48 0.50 ± 0.09 18.21 ± 0.09 299.49 ± 21.58 865.46 ± 30.81
TomoGRAF 0.79 ± 0.03 33.45 ± 0.13 175.48 ± 10.47 344.25 ± 10.32 0.85 ± 0.04 35.89 ± 0.13 146.73 ± 9.63 719.46 ± 26.78
Projection X2CT-GAN 0.34 ± 0.11 11.88 ± 0.19 51.96 ± 7.98 0.51 ± 0.09 18.23 ± 0.26 47.64 ± 7.32
MedNeRF 0.67 ± 0.07 25.02 ± 0.15 36.48 ± 4.36 0.69 ± 0.08 27.31 ± 0.14 33.42 ± 4.01
TomoGRAF 0.69 ± 0.03 25.43 ± 0.14 34.37 ± 4.58 0.71 ± 0.04 27.99 ± 0.13 31.22 ± 4.12

Fig 4. Patient-wise evaluated SSIM distribution (visualized using histograms with smoothed trend curves) for a, 1-view-based volume rendering results of TomoGRAF, X2CT-GAN, and MedNeRF in the test set.

Fig 4

b, 2-view-based volume rendering results of TomoGRAF, X2CT-GAN, and MedNeRF in the test set.

For 2D projection rendering, TomoGRAF achieves marginally better results than MedNeRF. Both models maintain the overall critical body shapes of GT, and TomoGRAF visualizes more detailed morphology, such as heart, spine, and vascular structures. In comparison, the projection results of X2CT-GAN show visible distortion and significantly worse quantitative performance.

For 3D volume reconstruction, as shown in Fig 3, with 1-view, TomoGRAF depicts rich and correct anatomical details with visible tumors and a pacemaker consistent with GT. The results are further refined with the second orthogonal X-ray view, improving fine details’ recovery. In comparison, MedNeRF and X2CT-GAN fail to render patient-relevant 3D volumes with 1 or 2 views. MedNeRF loses most anatomical details; X2CT-GAN deforms 3D anatomies that do not reflect patient-specific characteristics, such as lung tumors and the pacemaker.

As shown in Table 1, TomoGRAF is vastly superior in quantitative imaging metrics, achieving SSIM and PSNR of 0.79±0.03 and 33.45±0.13, respectively, vs. MedNeRF (SSIM at 0.37±0.05 and PSNR at 7.68±0.10) and X2CT-GAN (SSIM at 0.31±0.012 and PSNR at 14.39±0.19). There is a similar reduction in RMSE. Additionally, we can observe from Fig 4 that the SSIM distribution of TomoGRAF is highly left skewed and leptokurtic in both 1 and 2-view-based volume rendering, with the majority clustering tightly towards the higher end, while that of MedNeRF and X2CT-GAN tends to be normal and moderately right-skewed (values lean towards the lower end).

Table 2 shows the 3D reconstruction performance with or without 3D supervision using 1, 2, 5, and 10 views as input. Using more views improved both volume and projection inference performance. 3D GT training markedly boosted the model performance in 3D volume rendering only. Fig 5 shows line profile comparisons for varying view inputs. 1 view TomoGRAF recovered major structures but missed fine details, which were better preserved with more X-ray views.

Table 2. Statistical results of TomoGRAF ablation study evaluated on test set. 1/2/5/10-Views represent the number of views used for reconstruction reference. Training with and without 3D CT supervision is also compared in the current table. ↑ indicates the higher the statistical value, the better, and vice versa for ↓. SSIM and PSNR are calculated with images normalized to [0,1] scales and RMSE are calculated based on Hounsfield units (HUs).

Reference View 3D Supervised Training SSIM↑ PSNR(dB)↑ RMSE(HU) ↓ Inference Time (s) ↓
CT Volume 1 Y 0.79 ± 0.03 33.45 ± 0.13 175.48 ± 10.47 344.25 ± 10.32
N 0.66 ± 0.05 26.76 ± 0.16 197.47 ± 11.24
2 Y 0.85 ± 0.04 35.89 ± 0.13 146.73 ± 9.63 719.46 ± 26.78
N 0.69 ± 0.06 29.87 ± 0.19 168.35 ± 10.21
5 Y 0.88 ± 0.03 37.23 ± 0.13 138.45 ± 9.12 987.35 ± 37.89
N 0.72 ± 0.04 30.15 ± 0.18 147.56 ± 9.79
10 Y 0.93 ± 0.01 39.98 ± 0.11 127.68 ± 8.78 1238.81 ± 46.72
N 0.75 ± 0.01 31.86 ± 0.18 138.98 ± 9.54

Fig 5. Line profile comparison of the two patients shown in Fig 3.

Fig 5

The images visualized in a and c are GT slices downsampled to 128 × 128 to align with the prediction resolution. a, Indication of the location of the profile in patient 1 in coronal view across the lung tumor. b, Line profiles of 3D images rendered by TomoGRAF with 1, 2, 5, and 10-view inputs for patient 1. c, Indication of the location of the profile in patient 2 across the lung tumor in sagittal view. d, Line profiles of 3D images rendered by TomoGRAF with 1, 2, 5, and 10-view inputs for patient 2.

4. Discussion

This paper presents a GAN-embedded NeRF generator (TomoGRAF) for volumetric CT rendering from ultra-sparse X-ray views. TomoGRAF extends radiance fields into medical imaging reconstruction with a CT imaging-informed ray casting/tracing simulator. Also, TomoGRAF leverages the availability of 3D volumetric information at the training stage to enable an effective generator trained with full volumetric supervision. The robustness of TomoGRAF is demonstrated on an external dataset independent of the training set. TomoGRAF vastly improves 1-view 3D reconstruction performance yet scales well with additional views to accommodate practical balances in image acquisitions and quality requirements.

Reconstruction of 3D CT volume from ultra-sparse angular sampling is an ill-posed inverse problem that is extremely underconditioned to solve. On the other hand, such 3D reconstruction is practically desirable and widely applicable when a full gantry rotation is prevented by mechanical limitations or the dynamic process of interest is significantly faster than CT acquisition speed [5,6,9]. Therefore, there has been a consistent effort to reconstruct 3D images with extremely sparse views that circumvent these mechanical and temporal restrictions. Although CS-powered iterative methods and some earlier DL methods were able to reconstruct 3D images with as few as 20 views [18], the resultant image quality noticeably degraded. Still, they were unable to meet the challenges of many aforementioned practical scenarios where only one or two views were available for a given anatomical instance. Reconstruction with even more sparse views cannot be achieved without stronger priors and statistical learning. DL methods marched further in realizing ultra-sparse sampling (1 view) reconstruction using state-of-the-art (SOTA) networks, two of which are compared in this study.

TomoGRAF is distinctly superior to two SOTA methods for ultra-sparse view CT reconstruction in the following aspects. 1) In comparison to CNN-based networks with an extremely large number of parameters, such as X2CT-GAN [41], the radiance field-based generators train a lighter model with significantly fewer parameters (X2CT-GAN: ~ 4.28G FLOPS vs. TomoGRAF ~ 0.9G FLOPS, FLOPS: floating point operation per second) to represent the interaction process between photons and objects, which formalizes a better-defined goal for the network to reach and can effectively reduce the amount of training data required for achieving global robustness. Plus, CNN frameworks generally lack flexibility in view referencing. The number and angle of views at the inference stage must align with the input at the training stage. In addition to single-view referencing, TomoGRAF is capable of incorporating multiple X-ray views from diverse directions, as demonstrated in this study using uniformly distributed acquisition angles. It is worth noting that X2CT-GAN performance in the current study is markedly worse than the original report [41]. To determine the correctness of our implementation, we tested the X2CT-GAN code on the LIDC-IDRI data with the same data split and arrived at a similar performance. We believe the sharp decline in performance from internal LIDC-IDRI data testing to external in-house organized data testing is due to the differences in the training and testing data. CT images in LIDC-IDRI are cropped to keep only the thoracic organs, while our in-house test data are intact CT with the complete patient’s chest wall, arms, and neck. Such variation or domain shift is common and expected in practice: the patients can vary in size and be set up with different immobilization devices or arm positions. In stark contrast to X2CT-GAN, TomoGRAF is robust to such variation. 2) Compared to MedNeRF, we adapt NeRF with a physically realistic volume rendering mechanism based on the x-ray transportation properties, where photons pass through the body, and the volumetric photon attenuation along the ray path within the body is the focus of reconstruction. In other words, TomoGRAF learns 3D X-ray image formation physics, whereas MedNeRF assumes visible light transportation physics and is intended for 2D manifold learning of object surfaces. The limitation is evident in both low quantitative imaging metrics and orthogonal cuts of MedNeRF reconstructed patients: there is better retention of outer patient contour than internal anatomical details, which are largely lost in MedNeRF images. Additionally, TomoGRAF employs paired 3D CT supervision at the training stage to maximize the prior knowledge exposed to the network, which contributed to the model robustness in volume rendering at the inference stage. As a result, TomoGRAF successfully leverages the efficient object representation capacity of NeRF while overcoming the intrinsic limitations due to its lack of X-ray transportation physics and 3D volume comprehension. To our knowledge, TomoGRAF is the first truly generalizable single-view 3D X-ray reconstruction pipeline robust to substantial domain shifts.

At the practical level, TomoGRAF provides a unique solution for applications where only one or a few X-ray views are available, but 3D volumetric information is desired. The applications include image-guided radiotherapy, interventional radiology, and angiography. For the former, 2D kV X-rays can be interlaced with MV therapeutic X-ray beams to provide a real-time view of the patient during treatment [54]. However, the 2D projection images do not describe the full 3D anatomy, which is critical for adapting radiotherapy to the real-time patient target and surrounding tissue geometry. Similarly, 4D CT digitally subtracted angiography (DSA) better describes dynamics of the contrast for enhanced diagnosis than single-phase CT DSA [55], but fast helical and flat panel-based 4D-DSA requires repeated scans of the subject, increasing the imaging dose and leading to compromised temporal resolution for intricate vascular structures [56]. TomoGRAF can be potentially used to infer real-time time-resolved 3D DSA with significantly reduced imaging dose. Our results show that TomoGRAF is flexible in incorporating more views for further improved inference performance. Dual views with fixed X-ray systems are widely used in radiotherapy for stereotactic localization [38,57,58], but the modality is limited to triangulating bony anatomies or implanted fiducials. TomoGRAF can utilize the same 2D stereotactic views to provide rich 3D anatomies for soft tissue-based registration and localization. Besides mechanical and imaging dose constraints, inexpensive portable 2D X-rays are more readily available for point-of-care and low-resource settings where a CT is impractical. The ability to reconstruct 3D volumes using a single 2D view would markedly increase the imaging information available for clinical decisions. Our study also shows the feasibility of using more views in TomoGRAF for further improved performance and broader applications, including 4D CBCT and tomosynthesis with sparse or limited angle views.

At the theoretical level, TomoGRAF validates the extremely high data efficiency of neural field representation of 3D voxelized medical images. TomoGRAF, for the first time, materializes high data efficiency, achieving good quality (SSIM = 0.79–0.93) 3D reconstruction of CT images with 1–10 views, which is a major stride in comparison to existing research using NeRF or GRAF. The work thus has significant implications in 3D image acquisition, storage, and processing, which are currently voxel-based. Voxelized 3D representation does not provide intrinsic structural information regarding the relationships among voxels and thus can be expensive to acquire and reconstruct. Previous compressed sensing research explored some of the explicit structural correlations, such as piece-wise smoothness, for reduced data requirements. TomoGRAF indicates a new form of data representation that exploits implicit structural information with higher efficiency than conventional methods or neural networks without encoded physics.

Nevertheless, the current study leaves several areas for future improvement. First, TomoGRAF requires further fine-tuning at the inference stage, which increases the reconstruction time (1-view at 344.25 ± 10.32 s and 2-view at 719.46 ± 26.78 s). The time further increases with inference using more views. Significant acceleration is desired for online procedures such as motion adaptive radiotherapy [59]. Model compression techniques such as network pruning [60] and quantization [61] can decrease computational complexity while maintaining accuracy. Additionally, hardware acceleration via TensorRT [62] optimization or specialized processors (e.g., FPGAs [63], TPUs [64]) could also potentially improve the inference speed. Architecturally, incorporating efficient neural representations (e.g., lightweight MLPs [65] or hash-based encoding [66]) and adaptive sampling [67] methods could reduce computational overhead by prioritizing critical regions. Future work will explore these optimizations to improve TomoGRAF’s feasibility for real-time clinical applications, which would be essential for interventional procedures. Second, TomoGRAF is developed and tested on CT-synthesized DRR, which differs from kV X-rays obtained using an actual detector in image characteristics due to simplification of the physical projection model, detector dynamic ranges, noise, pre and postprocessing [39]. The current model may need to be adapted based on actual X-ray projections. Third, TomoGRAF reconstruction results with 1-view are geometrically correct but lose fine details and CT number accuracy, which is partially mitigated with increasing views up to 10. Therefore, in its current form, TomoGRAF is suited for object detection and localization tasks, but its appropriateness for quantitative tasks such as radiation dose calculation needs to be further studied. Moreover, the recovery of detail should also improve with reconstruction resolution, which is currently limited in rendering a maximum of 128 times 128 times 128 resolution due to GPU memory constraints. This limitation, however, is expected to be overcome soon with rapidly-increasing GPU memory capacity. Additionally, TomoGRAF exhibits moderate interpretability, as its foundation in generative radiance fields aligns with ray-based CT physics, ensuring a degree of physical consistency. The use of 2D DRRs and paired 3D ground truth during training enhances structured learning, while the subpatch-based approach improves generalizability. However, the implicit representation of NeRF structure poses challenges in direct voxel interpretation. While GAN-based training further introduces a black-box component, inference remains L2-based, reducing the risk of unrealistic features. Sparse-view adaptation further complicates interpretability, as the model’s implicit prior may influence reconstructions in ways distinct from traditional model-based iterative reconstruction methods. Future improvements could include feature sensitivity analyses and latent space visualization [68] to better delineate learned structures from data-driven priors. Lastly, while the current study primarily focuses on demonstrating the technical feasibility of ultra-sparse view reconstruction of the proposed TomoGRAF framework, we recognize that conventional quantitative image quality metrics may not fully capture the clinical utility of reconstructed images. As a future direction, incorporating clinical evaluation, such as qualitative scoring by radiologists or task-based diagnostic assessment, will further inform the real-world applicability and reliability of TomoGRAF in clinical practice.

5. Conclusion

TomoGRAF, a novel GAN-based NeRF generator, is presented in the current work. TomoGRAF is trained on a public dataset and evaluated on 100 in-house lung CTs. TomoGRAF reconstructed good quality 3D images with correct internal anatomies using 1–2 X-ray views, which state-of-the-art DL methods fail to accomplish. TomoGRAF performance further improves with more views. The superior TomoGRAF performance is attributed to novel x-ray physics encoding in the radiance field training and paired 3D CT supervision.

Supporting information

S1 Appendix. Siddon’s Ray Tracing algorithm pseudo code applied in TomoGRAF projection rendering module.

(DOCX)

pone.0330463.s001.docx (14.5KB, docx)

Data Availability

Data cannot be shared publicly because of institutional restriction. Data are available from the UCSF Institutional Data Access / Ethics Committee for researchers who meet the criteria for access to confidential data. Interested researchers should follow the instructions outlined on https://icd.ucsf.edu/materialdata-transfer-agreements. Please contact Industrycontracts@ucsf.edu | (415) 350-5408 for additional assistance.

Funding Statement

The research is supported by NIH R01CA259008, R44CA183390 and R01EB031577.

References

  • 1.Tuy HK. An inversion formula for cone-beam reconstruction. SIAM J Appl Math. 1983;43(3):546–52. doi: 10.1137/0143035 [DOI] [Google Scholar]
  • 2.Ma C-MC, Paskalev K. In-room CT techniques for image-guided radiation therapy. Med Dosim. 2006;31(1):30–9. doi: 10.1016/j.meddos.2005.12.010 [DOI] [PubMed] [Google Scholar]
  • 3.Gupta R, Walsh C, Wang IS, Kachelrieß M, Kuntz J, Bartling S. CT-Guided Interventions: Current Practice and Future Directions. In: Jolesz FA, editor. Intraoperative Imaging and Image-Guided Therapy [Internet]. New York, NY: Springer New York; 2014. pp. 173–91. [cited 2024 Feb 19]. Available from: https://link.springer.com/10.1007/978-1-4614-7657-3_12 [Google Scholar]
  • 4.Eschrich SA, Fulp WJ, Pawitan Y, Foekens JA, Smid M, Martens JWM, et al. Validation of a radiosensitivity molecular signature in breast cancer. Clin Cancer Res. 2012;18(18):5134–43. doi: 10.1158/1078-0432.CCR-12-0891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Vedantham S, Karellas A, Vijayaraghavan GR, Kopans DB. Digital Breast Tomosynthesis: State of the Art. Radiology. 2015;277(3):663–84. doi: 10.1148/radiol.2015141303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sechopoulos I. A review of breast tomosynthesis. Part II. Image reconstruction, processing and analysis, and advanced applications. Med Phys. 2013;40(1):014302. doi: 10.1118/1.4770281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kim K, Ye JC, Worstell W, Ouyang J, Rakvongthai Y, El Fakhri G, et al. Sparse-view spectral CT reconstruction using spectral patch-based low-rank penalty. IEEE Trans Med Imaging. 2015;34(3):748–60. doi: 10.1109/TMI.2014.2380993 [DOI] [PubMed] [Google Scholar]
  • 8.Rui X, Cheng L, Long Y, Fu L, Alessio AM, Asma E. Ultra-low dose CT attenuation correction for PET/CT: analysis of sparse view data acquisition and reconstruction algorithms. Phys Med Biol. 2015;60(19):7437–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Meinel FG, Nikolaou K, Weidenhagen R, Hellbach K, Helck A, Bamberg F, et al. Time-resolved CT angiography in aortic dissection. Eur J Radiol. 2012;81(11):3254–61. doi: 10.1016/j.ejrad.2012.03.006 [DOI] [PubMed] [Google Scholar]
  • 10.Shieh C-C, Kipritidis J, O’Brien RT, Kuncic Z, Keall PJ. Image quality in thoracic 4D cone-beam CT: a sensitivity analysis of respiratory signal, binning method, reconstruction algorithm, and projection angular spacing. Med Phys. 2014;41(4):041912. doi: 10.1118/1.4868510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Feldkamp LA, Davis LC, Kress JW. Practical cone-beam algorithm. J Opt Soc Am A. 1984;1(6):612. [Google Scholar]
  • 12.Gordon R, Bender R, Herman GT. Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and x-ray photography. J Theor Biol. 1970;29(3):471–81. doi: 10.1016/0022-5193(70)90109-8 [DOI] [PubMed] [Google Scholar]
  • 13.Andersen AH, Kak AC. Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrason Imaging. 1984;6(1):81–94. doi: 10.1177/016173468400600107 [DOI] [PubMed] [Google Scholar]
  • 14.Donoho DL. Compressed sensing. IEEE Trans Inform Theory. 2006;52(4):1289–306. doi: 10.1109/tit.2006.871582 [DOI] [Google Scholar]
  • 15.Sidky EY, Pan X. Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization. Phys Med Biol. 2008;53(17):4777–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu Y, Ma J, Fan Y, Liang Z. Adaptive-weighted total variation minimization for sparse data toward low-dose x-ray computed tomography image reconstruction. Phys Med Biol. 2012;57(23):7923–56. doi: 10.1088/0031-9155/57/23/7923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu Y, Liang Z, Ma J, Lu H, Wang K, Zhang H, et al. Total variation-stokes strategy for sparse-view X-ray CT image reconstruction. IEEE Trans Med Imaging. 2014;33(3):749–63. doi: 10.1109/TMI.2013.2295738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen GH, Tang J, Leng S. Prior image constrained compressed sensing (PICCS). In: Oraevsky AA, Wang LV, editors. San Jose, CA; 2008. pp. 685618. [cited 2024 Feb 26]. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.770532 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Lu K, He N, Li L. Nonlocal means-based denoising for medical images. Comput Math Methods Med. 2012;2012:438617. doi: 10.1155/2012/438617 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen Y, Shi L, Feng Q, Yang J, Shu H, Luo L, et al. Artifact suppressed dictionary learning for low-dose CT image processing. IEEE Trans Med Imaging. 2014;33(12):2271–92. doi: 10.1109/TMI.2014.2336860 [DOI] [PubMed] [Google Scholar]
  • 21.Chen Y, Yang Z, Hu Y, Yang G, Zhu Y, Li Y, et al. Thoracic low-dose CT image processing using an artifact suppressed large-scale nonlocal means. Phys Med Biol. 2012;57(9):2667–88. doi: 10.1088/0031-9155/57/9/2667 [DOI] [PubMed] [Google Scholar]
  • 22.Burger HC, Schuler CJ, Harmeling S. Image denoising: Can plain neural networks compete with BM3D? In: 2012 IEEE Conference on Computer Vision and Pattern Recognition [Internet]. Providence, RI: IEEE; 2012. pp. 2392–9. [cited 2024 Feb 26]. Available from: http://ieeexplore.ieee.org/document/6247952/ [Google Scholar]
  • 23.Burger HC, Schuler CJ, Harmeling S. Image denoising with multi-layer perceptrons, part 1: comparison with existing algorithms and with bounds [Internet]. arXiv; 2012. [cited 2024 Feb 26]. Available from: http://arxiv.org/abs/1211.1544 [Google Scholar]
  • 24.Prakash P, Kalra MK, Kambadakone AK, Pien H, Hsieh J, Blake MA, et al. Reducing Abdominal CT Radiation Dose With Adaptive Statistical Iterative Reconstruction Technique. Investig Radiol. 2010;45(4):202–10. [DOI] [PubMed] [Google Scholar]
  • 25.Kang E, Min J, Ye JC. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Med Phys. 2017;44(10):e360-75. [DOI] [PubMed] [Google Scholar]
  • 26.Jin KH, McCann MT, Froustey E, Unser M. Deep Convolutional Neural Network for Inverse Problems in Imaging. IEEE Trans Image Process. 2017;26(9):4509–22. [DOI] [PubMed] [Google Scholar]
  • 27.Pelt DM, Batenburg KJ. Improving filtered backprojection reconstruction by data-dependent filtering. IEEE Trans Image Process. 2014;23(11):4750–62. doi: 10.1109/TIP.2014.2341971 [DOI] [PubMed] [Google Scholar]
  • 28.Würfl T, Ghesu FC, Christlein V, Maier A. Deep Learning Computed Tomography. In: Ourselin S, Joskowicz L, Sabuncu MR, Unal G, Wells W, editors. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016 [Internet]. Lecture Notes in Computer Science. vol. 9902. Cham: Springer International Publishing; 2016. pp. 432–40. [cited 2024 Feb 26]. Available from: https://link.springer.com/10.1007/978-3-319-46726-9_50 [Google Scholar]
  • 29.Ma XF, Fukuhara M, Takeda T. Neural network CT image reconstruction method for small amount of projection data. Nucl Instrum Methods Phys Res Section A: Accelerators, Spectrometers Detectors Associated Equipment. 2000;449(1–2):366–77. doi: 10.1016/s0168-9002(99)01453-9 [DOI] [Google Scholar]
  • 30.Li S, Cao Q, Chen Y, Hu Y, Luo L, Toumoulin C. Dictionary learning based sinogram inpainting for CT sparse reconstruction. Optik. 2014;125(12):2862–7. doi: 10.1016/j.ijleo.2014.01.003 [DOI] [Google Scholar]
  • 31.Chen Y, Zhang Y, Shu H, Yang J, Luo L, Coatrieux J-L, et al. Structure-Adaptive Fuzzy Estimation for Random-Valued Impulse Noise Suppression. IEEE Trans Circuits Syst Video Technol. 2018;28(2):414–27. doi: 10.1109/tcsvt.2016.2615444 [DOI] [Google Scholar]
  • 32.Liu J, Ma J, Zhang Y, Chen Y, Yang J, Shu H, et al. Discriminative Feature Representation to Improve Projection Data Inconsistency for Low Dose CT Imaging. IEEE Trans Med Imaging. 2017;36(12):2499–509. doi: 10.1109/TMI.2017.2739841 [DOI] [PubMed] [Google Scholar]
  • 33.Lee H, Lee J, Cho S. View-interpolation of sparsely sampled sinogram using convolutional neural network. In: Styner MA, Angelini ED, editors. Orlando, Florida, United States; 2017. pp. 1013328. [cited 2024 Feb 26]. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.2254244 [Google Scholar]
  • 34.Podgorsak AR, Shiraz Bhurwani MM, Ionita CN. CT artifact correction for sparse and truncated projection data using generative adversarial networks. Med Phys. 2021;48(2):615–26. doi: 10.1002/mp.14504 [DOI] [PubMed] [Google Scholar]
  • 35.Sun J, Li H, Xu Z. Deep ADMM-Net for compressive sensing MRI. Adv Neu Inf Process Syst. 2016;29. [Google Scholar]
  • 36.Xu Y, Yan H, Ouyang L, Wang J, Zhou L, Cervino L, et al. A method for volumetric imaging in radiotherapy using single x-ray projection. Med Phys. 2015;42(5):2498–509. doi: 10.1118/1.4918577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hrinivich WT, Chernavsky NE, Morcos M, Li T, Wu P, Wong J, et al. Effect of subject motion and gantry rotation speed on image quality and dose delivery in CT-guided radiotherapy. Med Phys. 2022;49(11):6840–55. doi: 10.1002/mp.15877 [DOI] [PubMed] [Google Scholar]
  • 38.Xu D, Descovich M, Liu H, Lao Y, Gottschalk AR, Sheng K. Deep match: A zero-shot framework for improved fiducial-free respiratory motion tracking. Radiother Oncol. 2024;194:110179. doi: 10.1016/j.radonc.2024.110179 [DOI] [PubMed] [Google Scholar]
  • 39.Schafer S, Siewerdsen JH. Technology and applications in interventional imaging: 2D X-ray radiography/fluoroscopy and 3D cone-beam CT. Handbook of Medical Image Computing and Computer Assisted Intervention. Elsevier; 2020. pp. 625–71. [Google Scholar]
  • 40.Shen L, Zhao W, Xing L. Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng. 2019;3(11):880–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ying X, Guo H, Ma K, Wu J, Weng Z, Zheng Y. X2CT-GAN: Reconstructing CT from Biplanar X-Rays with Generative Adversarial Networks. arXiv. 2019. [cited 2024 Feb 15]. http://arxiv.org/abs/1905.06902 [Google Scholar]
  • 42.Schwarz K, Liao Y, Niemeyer M, Geiger A. GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. arXiv. 2021. [cited 2024 Feb 21]. http://arxiv.org/abs/2007.02442 [Google Scholar]
  • 43.Shen D, Wu G, Suk H-I. Deep Learning in Medical Image Analysis. Annu Rev Biomed Eng. 2017;19:221–48. doi: 10.1146/annurev-bioeng-071516-044442 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv. 2020. [cited 2024 Feb 20]. http://arxiv.org/abs/2003.08934 [Google Scholar]
  • 45.Isola P, Zhu JY, Zhou T, Efros AA. Image-to-Image Translation with Conditional Adversarial Networks. 2016. [cited 2023 Sep 4]; Available from: https://arxiv.org/abs/1611.07004 [Google Scholar]
  • 46.Corona-Figueroa A, Frawley J, Taylor SB, Bethapudi S, Shum HPH, Willcocks CG. MedNeRF: Medical Neural Radiance Fields for Reconstructing 3D-aware CT-Projections from a Single X-ray. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) [Internet]. Glasgow, Scotland, United Kingdom: IEEE; 2022. pp. 3843–8. [cited 2024 Feb 15]. Available from: https://ieeexplore.ieee.org/document/9871757/ [DOI] [PubMed] [Google Scholar]
  • 47.Siddon RL. Calculation of the radiological depth: Technical Reports: Calculation of the radiological depth. Med Phys. 1985;12(1):84–7. [DOI] [PubMed] [Google Scholar]
  • 48.Liu B, Zhu Y, Song K, Elgammal A. Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis [Internet]. arXiv; 2021. [cited 2024 Feb 21]. Available from: http://arxiv.org/abs/2101.04775 [Google Scholar]
  • 49.Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric [Internet]. arXiv; 2018. [cited 2024 Feb 21]. Available from: http://arxiv.org/abs/1801.03924 [Google Scholar]
  • 50.Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. 2015. doi: arXiv:1409.1556 [Google Scholar]
  • 51.Tran N-T, Tran V-H, Nguyen N-B, Nguyen T-K, Cheung N-M. On Data Augmentation for GAN Training. IEEE Trans Image Process. 2021;30:1882–97. doi: 10.1109/TIP.2021.3049346 [DOI] [PubMed] [Google Scholar]
  • 52.Kingma DP, Welling M. Auto-Encoding Variational Bayes [Internet]. arXiv; 2022. [cited 2024 Feb 22]. Available from: http://arxiv.org/abs/1312.6114 [Google Scholar]
  • 53.Armato SG 3rd, McLennan G, Bidaut L, McNitt-Gray MF, Meyer CR, Reeves AP, et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med Phys. 2011;38(2):915–31. doi: 10.1118/1.3528204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Li R, Mok E, Chang DT, Daly M, Loo BW, Diehn M. Intrafraction verification of gated RapidArc by using beam-level kilovoltage X-ray images. Int J Radiat Oncol Biol Phys. 2012;83(5):e709–715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Haubenreisser H, Bigdeli A, Meyer M, Kremer T, Riester T, Kneser U, et al. From 3D to 4D: Integration of temporal information into CT angiography studies. Eur J Radiol. 2015;84(12):2421–4. doi: 10.1016/j.ejrad.2015.06.014 [DOI] [PubMed] [Google Scholar]
  • 56.Keil F, Bergkemper A, Birkhold A, Kowarschik M, Tritt S, Berkefeld J. 4D Flat Panel Conebeam CTA for Analysis of the Angioarchitecture of Cerebral AVMs with a Novel Software Prototype. AJNR Am J Neuroradiol. 2022;43(1):102–9. doi: 10.3174/ajnr.A7382 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Lewis BC, Snyder WJ, Kim S, Kim T. Monitoring frequency of intra-fraction patient motion using the ExacTrac system for LINAC-based SRS treatments. J Appl Clin Med Phys. 2018;19(3):58–63. doi: 10.1002/acm2.12279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Kilby W, Dooley JR, Kuduvalli G, Sayeh S, Maurer CR. The CyberKnife Robotic Radiosurgery System in 2010. Technol Cancer Res Treat. 2010;9(5):433–52. [DOI] [PubMed] [Google Scholar]
  • 59.Keall PJ, Sawant A, Berbeco RI, Booth JT, Cho B, Cerviño LI, et al. AAPM Task Group 264: The safe clinical implementation of MLC tracking in radiotherapy. Med Phys. 2021;48(5):e44–64. doi: 10.1002/mp.14625 [DOI] [PubMed] [Google Scholar]
  • 60.Han S, Pool J, Tran J, Dally WJ. Learning both Weights and Connections for Efficient Neural Networks [Internet]. arXiv; 2015. [cited 2025 Mar 17]. Available from: http://arxiv.org/abs/1506.02626 [Google Scholar]
  • 61.Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [Internet]. arXiv; 2017. [cited 2025 Mar 17]. Available from: http://arxiv.org/abs/1712.05877 [Google Scholar]
  • 62.Shafi O, Rai C, Sen R, Ananthanarayanan G. Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices. In: 2021 IEEE International Symposium on Workload Characterization (IISWC) [Internet]. Storrs, CT, USA: IEEE; 2021. pp. 226–37. [cited 2025 Mar 17]. Available from: https://ieeexplore.ieee.org/document/9668285/ [Google Scholar]
  • 63.Nurvitadhi E, Venkatesh G, Sim J, Marr D, Huang R, Ong Gee Hock J, et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Monterey California USA: ACM; 2017. pp. 5–14. doi: 10.1145/3020078.3021740 [DOI] [Google Scholar]
  • 64.Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. Toronto ON Canada: ACM; 2017. pp. 1–12. [cited 2025 Mar 17]. doi: 10.1145/3079856.3080246 [DOI] [Google Scholar]
  • 65.Sitzmann V, Martel JNP, Bergman AW, Lindell DB, Wetzstein G. Implicit Neural Representations with Periodic Activation Functions. arXiv. 2020. [cited 2024 Oct 3]. http://arxiv.org/abs/2006.09661 [Google Scholar]
  • 66.Müller T, Evans A, Schied C, Keller A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans Graph. 2022;41(4):1–15. doi: 10.1145/3528223.3530127 [DOI] [Google Scholar]
  • 67.Hedman P, Srinivasan PP, Mildenhall B, Barron JT, Debevec P. Baking Neural Radiance Fields for Real-Time View Synthesis [Internet]. arXiv; 2021. [cited 2025 Mar 17]. Available from: http://arxiv.org/abs/2103.14645 [DOI] [PubMed] [Google Scholar]
  • 68.Patrício C, Neves JC, Teixeira LF. Explainable deep learning methods in medical image classification: a survey. ACM Comput Surv. 2024;56(4):1–41. [Google Scholar]

Decision Letter 0

Zhentian Wang

13 Jun 2025

PONE-D-25-15335TomoGRAF: An X-Ray Physics-Driven Generative Radiance Field Framework for Extremely Sparse View CT ReconstructionPLOS ONE

Dear Dr. Sheng,

Thank you for submitting your manuscript to PLOS ONE. After careful evaluation, the reviewers raised some relevant concerns that need to be addressed first. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 28 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Zhentian Wang, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating in your Funding Statement: The research is supported by NIH R01CA259008, R44CA183390 and R01EB031577.  Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.  Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement. Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: The research is supported by NIH R01CA259008, R44CA183390 and R01EB031577. We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: The research is supported by NIH R01CA259008, R44CA183390 and R01EB031577.  Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 4. We note that you have indicated that there are restrictions to data sharing for this study. For studies involving human research participant data or other sensitive data, we encourage authors to share de-identified or anonymized data. However, when data cannot be publicly shared for ethical reasons, we allow authors to make their data sets available upon request. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.  Before we proceed with your manuscript, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., a Research Ethics Committee or Institutional Review Board, etc.). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories. You also have the option of uploading the data as Supporting Information files, but we would recommend depositing data directly to a data repository if possible. Please update your Data Availability statement in the submission form accordingly. 5. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1:  This article describes a machine learning framework to allow for 3D medical image reconstruction from limited angle viewing data. While my background includes experience with NeRF and related techniques, I do not have a medical background. Nonetheless, I enjoyed reading this article, understood its significance, and am pleased to recommend it for publication. I have a few comments that, if properly addressed, will substantially improve the paper:

1) My understanding of the method, as described in Sec. II, relied heavily on reading the background literature related to GRAF, MedNeRF and X2CT-GAN. I highly recommend that the authors consider creating some additional illustrations to complement the architecture shown in Figure 1. In particular, the details of the Generator and Discriminator and the physical meaning of the varying inputs was difficult to follow without directly referencing other papers. I also suspect that Figure 1 has some typos, such as the input to the Generator.

2) Overall, the writing was good and generally easy to read; however, the manuscript would benefit from another revision. There are still several typos and grammatical errors that would benefit from proofreading.

3) Should Eq 16 be RMSE, not SSIM?

Reviewer #2:  The authors present TomoGRAF, an innovative approach leveraging ultra-sparse projections to achieve high-quality 3D CT volume reconstruction, marking a significant advancement in X-ray physics and CT imaging. This work introduces the first universal framework for image-guided radiotherapy and interventional radiology, offering substantial clinical utility. While the manuscript demonstrates considerable promise for publication in PLOS ONE, the authors should address the following points prior to final acceptance.

1. Most NeRF methods rely on 2D supervision, while this paper mentions using 3D CT as the supervision signal, but does not explicitly explain how to ensure the effectiveness of 3D supervision during training. How can 3D supervision improve fidelity? It is suggested to add relevant ablation experiments or analysis to demonstrate the specific impact of 3D supervision on the final reconstruction quality.

2. The paper claims to model X-ray attenuation, but it is unclear whether scattering, beam hardening, or noise were considered. These factors are crucial for real CT simulations.

3. The paper only mentions two data augmentation techniques - random flipping and rotation. Were other data augmentation methods employed, such as noise injection or varying SID/SAD, to improve generalization capability?

4. For the loss functions in Equation (11) and Equation (12), how were the parameters α and β determined? Were ablation experiments conducted to ensure they are optimal values? For the loss function in Equation (13), how were the parameters γ, δ, and ε determined? And how were θ and ϕ determined in the evaluation metrics?

5. In the experiments, why was the peak signal-to-noise ratio (PSNR) = 25 set as the stopping threshold? Why not choose a higher PSNR value?

6. During the experimental process, your model's baseline performance was established using a single AP view. To determine model performance, reconstructions were additionally performed with 1, 2, 5, and 10 views. The view angles were specified as follows: for 1-view reconstruction, the AP view was used as reference; for 2-view reconstruction, AP and lateral views were used for inference; for 5-view reconstruction, starting from the AP view, every 72° rotation was applied to cover the full 360°; for 10-view reconstruction, starting from the AP view, every 36° rotation was applied to cover the full 360°. However, in the Discussion section, the statement "Meanwhile, TomoGRAF, besides 1-view referencing, can leverage additional X-ray views at arbitrary angles" appears to lack strong experimental support, since the experimental procedure in this study clearly defined view angles with uniform angular intervals, which seems inconsistent with "arbitrary angles". We recommend either: (1) adding experiments to demonstrate the value of arbitrary angles (e.g., using only the AP view and its adjacent angles) for TomoGRAF during inference, or (2) rephrasing this statement to better align with the actual experimental conditions.

7. TomoGRAF requires fine-tuning during the inference phase. For specific fine-tuning, the trained prior model is optimized under the supervision of the patient's 2D sparse view projections to adapt to new patient anatomies. Does this mean that during the inference phase, fine-tuning is required for projection images from every angle used? If so, this would lead to increased inference time, and when more views are used for inference, the time would increase accordingly. The paper mentions that single-view reconstruction takes approximately 344 seconds, while each additional view roughly doubles the reconstruction time, which appears relatively slow for practical applications. Can the network be improved to enhance its computational efficiency?

8. In the quantitative analysis, this paper employed SSIM, PSNR, and RMSE metrics, but it did not thoroughly discuss in the qualitative analysis whether these metrics can reflect clinical diagnostic requirements. In clinical applications, the accuracy of CT images is crucial. Have you considered inviting clinicians to evaluate the generated predicted CT images and assess their feasibility?

9. The pseudocode of the Siddon algorithm (Appendix 1) serves as an important methodological supplement. Please further clarify its specific implementation details in TomoGRAF, such as how it integrates with the fully connected network - does the network directly learn attenuation coefficients, or is this achieved through post-processing?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Comments.docx

pone.0330463.s002.docx (14.9KB, docx)
PLoS One. 2025 Aug 22;20(8):e0330463. doi: 10.1371/journal.pone.0330463.r002

Author response to Decision Letter 1


3 Jul 2025

Q: Most NeRF methods rely on 2D supervision, while this paper mentions using 3D CT as the supervision signal but does not explicitly explain how to ensure the effectiveness of 3D supervision during training. How can 3D supervision improve fidelity? It is suggested to add relevant ablation experiments or analysis to demonstrate the specific impact of 3D supervision on the final reconstruction quality.

A: We thank the reviewer for highlighting the importance of analyzing the effect of 3D supervision. Compared to prior NeRF-based approaches such as MedNeRF, our TomoGRAF framework introduces two major enhancements to improve reconstruction fidelity:

a more physically accurate forward model using Siddon’s ray tracing, and

the incorporation of 3D supervision from paired CT volumes during training.

These two design choices work in tandem to enhance anatomical consistency, especially under ultra-sparse-view constraints. Since our model ultimately generates 3D volumes at inference time, incorporating 3D supervision during training is both intuitive and effective—it aligns the learning objective with the test-time task and enables the model to learn more robust and structured anatomical priors.

We agree that many existing NeRF-based medical imaging frameworks rely on 2D supervision primarily due to the unavailability of paired 3D data for objects viewed under natural lights, rather than by design choice. In contrast, TomoGRAF is trained on paired 2D–3D data, giving the unique advantage of X-ray imaging in visualizing complex 3D anatomical structures.

While we agree that an ablation study isolating the impact of 3D supervision would be informative, the current manuscript, which is already lengthy, includes extensive evaluations across various sparse-view configurations. To maintain focus and clarity, we limited the scope to demonstrating the feasibility and effectiveness of the proposed framework. A deeper exploration of supervision strategies and architectural contributions should belong to a separate future study.

Q: The paper claims to model X-ray attenuation, but it is unclear whether scattering, beam hardening, or noise were considered. These factors are crucial for real CT simulations.

A: We thank the reviewer for raising this important point. Our current implementation of the forward model uses Siddon’s ray tracing algorithm, which simulates X-ray attenuation as line integrals through a voxelized volume. This approach assumes a monochromatic source and neglects scattering, beam hardening, and noise. While this simplified model is commonly used in simulation studies due to its computational efficiency and clarity, we agree that these effects are relevant to actual image quality. Future work will incorporate more realistic forward models that simulate polychromatic spectra, scatter, and noise characteristics in sparse reconstruction using actual projection X-ray images. That said, the focus of this study is to evaluate theoretical potential of TomoGRAF under ideal conditions.

Q: The paper only mentions two data augmentation techniques - random flipping and rotation. Were other data augmentation methods employed, such as noise injection or varying SID/SAD, to improve generalization capability?

A: We thank the reviewer for this thoughtful question. In this work, we employed random flipping and rotation as basic data augmentation strategies. We did not apply noise injection or vary geometric parameters such as SID/SAD during training, for the following reasons:

SID/SAD Parameters: These values were extracted directly from each patient’s DICOM metadata and therefore naturally vary across the cohort, reflecting realistic acquisition conditions without the need for synthetic augmentation.

Noise Injection: Our training cohort includes over 1,000 patients, each contributing multiple view angles and patches for supervision, offering substantial variability in anatomy, pose, and appearance. Since our method is designed to learn a universal prior during training—rather than rely on patient-specific fine-tuning or enhance the robustness of a CNN inference model—we determined that additional noise injection was unnecessary.

Purpose of the Prior: Since the prior is trained to capture general anatomical structure across patients and is later fine-tuned to sparse-view test data, our focus was on extracting shared structural features rather than simulating acquisition-specific degradations (which are better handled during test-time optimization).

That said, we agree that investigating the effect of more aggressive augmentations (including noise) could be a direction for further robustifying the prior, especially when adapting TomoGRAF to real clinic which requires institution-specific fine tuning on the prior. The additional augmentation is unnecessary for the current study which is based on relatively large and diverse datasets.

Q: For the loss functions in Equation (11) and Equation (12), how were the parameters α and β determined? Were ablation experiments conducted to ensure they are optimal values? For the loss function in Equation (13), how were the parameters γ, δ, and ε determined? And how were θ and φ determined in the evaluation metrics?

A: The parameters α, ϕ, γ, δ, and ξ were not used in Equations (11-12). The parameters ϕ, α and ξ were mentioned and defined in Page 12 as “Overall, the generator G takes x-ray source setup matrix K, view direction (pose) ξ=(θ,ϕ) , 2D sampling pattern v, shape code z_sh∈R^(M_s ) and appearance code z_a∈R^(M_a ) as input” as highlighted in green in the manuscript, where ϕ is the elevation angel, and θ is the azimuthal angel of view position pose ξ, and appearance code z_a is a latent vector that encodes view-dependent features and is optimized by the model during training. The parameter δ were mentioned in Page 12 as “The output is the material density δ in the corresponding x, where ϑ represents the network parameters and z_sh~p_sh and z_a~p_a with p_sh and p_a drawn from standard Gaussian distribution.” (highlighted in green in the manuscript), where δ represents the material density within the 3D CT volume which needs to be learnt by the trained model. The parameter γ was defined in page 14 as “Where L_x and L_ξ represent the latent codes of x and ξ, M_sh and M_a define the shape and appearance codes with z_sh∈R^(M_sh ) and z_a∈R^(M_a ), and γ(∙) represents positional encoding.” (highlighted in green in the manuscript), where γ(∙) represents positional encoding of the model. We did not use β in Equation (11-13) and throughout the manuscript.

Additionally, in Equations (11-13), hyperparameters λ_1, λ_2 〖,λ〗_3,〖 λ〗_4 and λ_5 are used to control the relative contributions of different loss terms (Equations 11–13) and metric components. These values were empirically chosen based on standard practices in the literature and preliminary experiments to ensure training stability and reasonable convergence.

Q: In the experiments, why was the peak signal-to-noise ratio (PSNR) = 25 set as the stopping threshold? Why not choose a higher PSNR value?

A: We appreciate the reviewer’s question. The PSNR (a hyperparameter) = 25 stopping threshold was selected based on empirical observations during test-time optimization. In our ultra-sparse view setup (e.g., 1–5 projection views), we found that PSNR values above 25 already corresponded to visually and structurally meaningful reconstructions. Setting a higher threshold (e.g., PSNR ≥ 30) offered marginal improvements while significantly increasing the computational burden and risk of overfitting to noise or limited view information. Moreover, our goal was not to achieve the maximum PSNR of the model inference to the referenced sparse views, but rather to perform efficient fine-tuning sufficient for realistic anatomical rendering. We found that PSNR ≈ 25 served as a practical and consistent early stopping criterion across different test cases. Nonetheless, we agree that adaptive or dynamic stopping strategies based on perceptual metrics can be explored in future work for deploying TomoGRAF into clinics.

Q: During the experimental process, your model's baseline performance was established using a single AP view. To determine model performance, reconstructions were additionally performed with 1, 2, 5, and 10 views. The view angles were specified as follows: for 1-view reconstruction, the AP view was used as reference; for 2-view reconstruction, AP and lateral views were used for inference; for 5-view reconstruction, starting from the AP view, every 72° rotation was applied to cover the full 360°; for 10-view reconstruction, starting from the AP view, every 36° rotation was applied to cover the full 360°. However, in the Discussion section, the statement "Meanwhile, TomoGRAF, besides 1-view referencing, can leverage additional X-ray views at arbitrary angles" appears to lack strong experimental support, since the experimental procedure in this study clearly defined view angles with uniform angular intervals, which seems inconsistent with "arbitrary angles". We recommend either: (1) adding experiments to demonstrate the value of arbitrary angles (e.g., using only the AP view and its adjacent angles) for TomoGRAF during inference, or (2) rephrasing this statement to better align with the actual experimental conditions.

A: We thank reviewer for pointing this out. We have revised our statement in Discussion to “In addition to single-view referencing, TomoGRAF is capable of incorporating multiple X-ray views from diverse directions, as demonstrated in this study using uniformly distributed acquisition angles.”

Q: TomoGRAF requires fine-tuning during the inference phase. For specific fine-tuning, the trained prior model is optimized under the supervision of the patient's 2D sparse view projections to adapt to new patient anatomies. Does this mean that during the inference phase, fine-tuning is required for projection images from every angle used? If so, this would lead to increased inference time, and when more views are used for inference, the time would increase accordingly. The paper mentions that single-view reconstruction takes approximately 344 seconds, while each additional view roughly doubles the reconstruction time, which appears relatively slow for practical applications. Can the network be improved to enhance its computational efficiency?

A: We appreciate the reviewer’s concern regarding inference efficiency. To clarify, fine-tuning is not performed independently for each projection view. Instead, during test-time optimization, random image patches from all available views are sampled and fed into the model in a unified optimization loop. As the number of input views increases, the model benefits from more diverse supervision, which may slightly increase the number of iterations needed to converge, but this increase is sublinear (as demonstrated in Figure 1 in this response letter) rather than linear. In our experiments (Table 2), we observed that while reconstruction time increases with additional views, the marginal cost per view decreases, given improved convergence behavior.

We agree that inference speed is a crucial aspect for practical deployment. Therefore, strategies to improve computational efficiency have been thoroughly discussed in the Limitations section of the original manuscript as “First, TomoGRAF requires further fine-tuning at the inference stage, which increases the reconstruction time (1-view at 344.25±10.32 s and 2-view at 719.46±26.78 s). The time further increases with inference using more views. Significant acceleration is desired for online procedures such as motion adaptive radiotherapy (59). Model compression techniques such as network pruning (60) and quantization (61) can decrease computational complexity while maintaining accuracy. Additionally, hardware acceleration via TensorRT (62) optimization or specialized processors (e.g., FPGAs (63), TPUs (64)) could also potentially improve the inference speed. Architecturally, incorporating efficient neural representations (e.g., lightweight MLPs (65) or hash-based encoding (66)) and adaptive sampling (67) methods could reduce computational overhead by prioritizing critical regions. Future work will explore these optimizations to improve TomoGRAF’s feasibility for real-time clinical applications, which would be essential for interventional procedures.” and the content has been highlighted in green in the manuscript.

Figure 1: Relationship between TomoGRAF inference time and number of referenced sparse views.

Q: In the quantitative analysis, this paper employed SSIM, PSNR, and RMSE metrics, but it did not thoroughly discuss in the qualitative analysis whether these metrics can reflect clinical diagnostic requirements. In clinical applications, the accuracy of CT images is crucial. Have you considered inviting clinicians to evaluate the generated predicted CT images and assess their feasibility?

A: We thank the reviewer for highlighting this important consideration. We agree that conventional quantitative metrics such as SSIM, PSNR, and RMSE, while commonly used in the literature, do not fully capture the diagnostic relevance of reconstructed CT images. In this study, our focus was on establishing a technical proof-of-concept for the TomoGRAF framework, and as such, we did not incorporate clinical reader evaluations. That said, we acknowledge the value of involving clinical experts in future evaluations, especially as we move toward applying this framework to real patient data. We added a note in the Discussion section to reflect this important point and outline plans for future clinical validation as “Lastly, while the current study primarily focuses on demonstrating the technical feasibility of ultra-sparse view reconstruction of the proposed TomoGRAF framework, we recognize that conventional quantitative image quality metrics may not fully capture the clinical utility of reconstructed images. As a future direction, incorporating clinical evaluation, such as qualitative scoring by radiologists or task-based diagnostic assessment, will be informative to assess the real-world applicability and reliability of TomoGRAF in clinical practice.”

Q: The pseudocode of the Siddon algorithm (Appendix 1) serves as an important methodological supplement. Please further clarify its specific implementation details in TomoGRAF, such as how it integrates with the fully connected network - does the network directly learn attenuation coefficients, or is this achieved through post-processing?

A: We thank the reviewer for this insightful question. Siddon’s ray tracing algorithm serves as the forward projection operator within the TomoGRAF framework, replacing the original ray tracing in NeRF method for natural lights to compute line integrals through the reconstructed volume.

Specifically, the fully connected network directly outputs voxel-wise attenuation coefficients that represent the 3D volume. Siddon’s algorithm then integrates these coefficients along rays corresponding to the given projection views to produce synthetic 2D projections. This forward projection step is fully differentiable and embedded within the network training loop, allowing end-to-end optimization. There is no separate post-processing step for attenuation coefficients; the network learns to represent the volume implicitly, and Siddon’s algorithm models the physics of X-ray projection during

Attachment

Submitted filename: Response to Reviewers_Final.docx

pone.0330463.s004.docx (75.3KB, docx)

Decision Letter 1

Zhentian Wang

1 Aug 2025

TomoGRAF: An X-Ray Physics-Driven Generative Radiance Field Framework for Extremely Sparse View CT Reconstruction

PONE-D-25-15335R1

Dear Dr. Sheng,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zhentian Wang, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Both reviewers have confirmed that their comments have been addressed in the revision.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #2: No

**********

Acceptance letter

Zhentian Wang

PONE-D-25-15335R1

PLOS ONE

Dear Dr. Sheng,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Zhentian Wang

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Siddon’s Ray Tracing algorithm pseudo code applied in TomoGRAF projection rendering module.

    (DOCX)

    pone.0330463.s001.docx (14.5KB, docx)
    Attachment

    Submitted filename: Comments.docx

    pone.0330463.s002.docx (14.9KB, docx)
    Attachment

    Submitted filename: Response to Reviewers_Final.docx

    pone.0330463.s004.docx (75.3KB, docx)

    Data Availability Statement

    Data cannot be shared publicly because of institutional restriction. Data are available from the UCSF Institutional Data Access / Ethics Committee for researchers who meet the criteria for access to confidential data. Interested researchers should follow the instructions outlined on https://icd.ucsf.edu/materialdata-transfer-agreements. Please contact Industrycontracts@ucsf.edu | (415) 350-5408 for additional assistance.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES