Abstract
The ability to reconstruct immersive and realistic three-dimensional scenes plays a fundamental role in advancing virtual reality, digital twins, and related fields. With the rapid development of differentiable rendering frameworks, the reconstruction quality of static scenes has been significantly improved. However, we observe that the challenge of insufficient initialization has been largely overlooked in existing studies, while at the same time heavily relying on dense multi-view imagery that is difficult to obtain. To address these challenges, we propose a pipeline for text driven 3D scene generation, which employs panoramic images as an intermediate representation and integrates with 3D Gaussian Splatting to enhance reconstruction quality and efficiency. Our method introduces an improved point cloud initialization using Fibonacci lattice sampling of panoramic images, combined with a dense perspective pseudo label strategy for teacher–student distillation supervision, enabling more accurate scene geometry and robust feature learning without requiring explicit multi-view ground truth. Extensive experiments validate the effectiveness of our method, consistently outperforming state-of-the-art methods across standard reconstruction metrics.
Keywords: 3D reconstruction, panoramic image, point cloud initialization
1. Introduction
With the widespread adoption of virtual reality (VR) devices and content, immersive and realistic 3D reconstructions have become increasingly important. Traditional scanning pipelines can capture accurate geometric structures and spatial relationships but are costly in both time and labor. Recent developments in deep learning have given rise to learning-based reconstruction approaches such as photogrammetry [1,2,3,4], Neural Radiance Fields (NeRFs) [5,6,7,8], and 3D Gaussian Splatting (3DGS) [9,10,11,12,13,14]. Although these methods have achieved impressive results, they typically depend on multi-view observations to build complete spatial understanding, requiring high quality imaging with multiple or moving cameras, posing a barrier for general users.
Reliable depth estimation is the cornerstone of these reconstruction pipelines because it links image data to 3D geometry. Structure-from-motion (SfM) and multi-view stereo (MVS) rely on feature matching and photometric consistency to estimate dense disparity across views. Despite numerous improvements from learning-based approaches [15,16,17], depth prediction remains unreliable under sparse views, small baselines, or weak textures. Several studies on sparse view depth estimation [18,19,20] attempt to predict depth from limited inputs, but reconstruction quality remains constrained by the lack of parallax and multi-view consistency.
To mitigate dependence on multi-view capture, we adopt panoramas as alternative sources of global scene context. A single panorama provides a comprehensive view of an entire scene within one frame, offering broader spatial coverage and richer contextual cues than conventional perspective images. Recent latent diffusion models, such as MVDiffusion [21] and StitchDiffusion [22], have demonstrated strong capability in generating high fidelity panoramic images from textual descriptions, enabling users to obtain realistic scene priors directly from text. These developments suggest that panoramic representations can potentially replace multi-view inputs for reconstruction by providing rich global context.
Although panoramas provide global coverage, they lack parallax, making depth estimation inherently ambiguous and unstable. Monocular and panoramic depth estimation methods [23,24] partially alleviate this issue, yet they remain susceptible to scale ambiguity and instability in texture-less regions. Within 3DGS, the quality of point cloud initialization (which depends on depth accuracy) strongly affects optimization stability and final reconstruction quality. Existing pipelines often initialize 3DGS using SFM, such as COLMAP [25] and FlowMap [26]. While these methods recover camera poses and sparse geometry effectively, their performance degrades when texture information is limited. Thus, a robust, panorama-aware initialization is essential for preserving fine Gaussian details and ensuring stable optimization.
To this end, we propose a novel pipeline that generates a 3D Gaussian representation specifically designed for panoramic formats. The method leverages text-to-panorama generation to efficiently recover comprehensive 3D scenes from a single panoramic image, enhancing reconstruction quality by exploiting the wide field of view and rich contextual cues of panoramic imagery. Advanced text generation techniques are first applied to refine prompt quality and image boundaries through adaptive weight smoothing. Multiple overlapping perspective projections are then generated as pseudo labels, which transfer feature knowledge to a student model via knowledge distillation. During initialization, Fibonacci sampling determines camera positions and intrinsic parameters to produce multi-view depth maps, which are transformed into spherical coordinates and logarithmically compressed to reduce depth variation. The optimized depth is remapped into 3D space to form a detailed point cloud, which is subsequently converted into optimizable 3D Gaussian primitives. The continuous Gaussian representation fills depth gaps, while transmittance weighting distinguishes surface from internal Gaussians, preserving crucial structural details.
In summary, our work introduces the following key contributions:
We propose a novel 3D Gaussian representation pipeline that leverages text-to-panorama generation to enhance scene reconstruction performance.
Our method employs Fibonacci sampling for point cloud initialization, which substantially improves reconstruction quality.
Our method employs dense pseudo labels to guide student model learning from the teacher model, and results show consistent improvements over deterministic teachers.
Extensive experiments demonstrate that our method achieves higher quality reconstructions than SOTA approaches.
2. Methods
As illustrated in Figure 1, the proposed framework begins with text-to-panorama generation using a diffusion model. The resulting panoramic images are used for pseudo label generation and point cloud initialization, which are further optimized into 3D Gaussian primitives through differentiable rasterization under loss constraints. This process integrates generative and optimization-based components for high fidelity panoramic reconstruction. Section 2.1 introduces the basic concept of 3DGS, Section 2.2 describes the text-to-panorama generation process, Section 2.3 presents dense pseudo label generation, Section 2.4 details point cloud initialization, and Section 2.5 discusses adaptive Gaussian optimization.
Figure 1.
Overview of the proposed experimental framework, including text-to-panorama generation, pseudo label extraction, point cloud initialization, and adaptive Gaussian optimization.
2.1. Preliminary
3DGS employs Gaussian ellipsoids [27] to effectively parallelize the representation of the scene or object intended for reconstruction. Each Gaussian contains center position , covariance matrix , opacity , and SH coefficients, collectively known as the Gaussian mean:
| (1) |
In order to guarantee the positive semi-definiteness of the covariance matrix, it can be factorized into a scaling matrix and a rotation matrix . We retain the diagonal vector of the scaling matrix and a quaternion vector of the rotation matrix for Gaussian distributions:
| (2) |
During rendering, 3D Gaussians are projected onto a 2D plane, forming ellipses that are approximated as circles by maintaining their center coordinates and radius. This process entails converting a 3D covariance matrix to a 2D covariance matrix using a combination of the projective transformation matrix and the viewing transformation matrix:
| (3) |
The 2D plane is partitioned into distinct blocks, with individual Gaussians associated with specific blocks, influencing all pixels within them. When there is overlap between 2D Gaussians, determining the relative depths of each Gaussian is necessary to address occlusion issues from closer Gaussians. Rendering of blocks and pixels occurs independently and simultaneously following a predefined sequence, culminating in color computation via alpha blending:
| (4) |
represents the learned color, while the final opacity is the product of the learned opacity and the Gaussian function:
| (5) |
where and represent the coordinates in the projection space.
2.2. Text to Panoramic Image
We employ a latent denoising strategy that combines sliding window sampling with adaptive weight blending to reduce discontinuities at the left and right boundaries; the entire procedure is detailed in Figure 2. Initially, the noisy latent is divided into nine overlapping patches through sliding window sampling. Next, the text prompt is encoded with a pretrained text encoder, optionally augmented with a LoRA [28] module for refined parameter adjustment. The resultant text embedding is then integrated into the diffusion model at each denoising step, influencing the denoising process of every latent patch; this process can be mathematically expressed as:
| (6) |
Figure 2.
Pipeline of latent panoramic image generation using sliding window sampling and adaptive blending within the diffusion denoising process.
Additionally, an adaptive weight w is employed to blend the initial left and final right patches. Subsequently, denoising is also applied to the blended patch to maintain coherence in the border areas, the formula for this process is expressed as:
| (7) |
After integrating and weighting the outputs of the diffusion model, the resulting latent representation is passed to the decoder for generation. To ensure high fidelity, the central region is retained in the ultimate output.
2.3. Dense Perspective Pseudo Label Generation for Teacher–Student Distillation Supervision
The inherent inconsistency among Gaussian distributions requires individual optimization of their properties using image supervision. However, panoramic images inherently lack explicit multi-view information, limiting their effectiveness in tasks demanding precise geometric comprehension and reconstruction. To overcome this limitation, we introduce a framework for creating dense perspective pseudo labels to facilitate teacher–student distillation supervision.
Given the panoramic image P and a designated camera position, we project it onto N overlapping perspective tangent images. Prior studies suggest that using 20 tangent images sufficiently covers the spherical surface resulting from icosahedral projection. By positioning multiple sets of densely distributed virtual perspective cameras within the spherical domain and applying the same projection method, we generate dense pseudo labels that capture fine scene details.
To enhance reconstruction, we leverage a powerful pretrained model (Moge2) as the teacher model and implement knowledge distillation to transfer the acquired knowledge from the teacher model to our student model without the need for actual 3D ground truth data. Simultaneously, it functions as a supervised label to guide the acquisition of crucial feature information by the student model from the teacher model, and it enables optimization through an appropriate loss function. During the training phase, the loss function is defined as the following weighted sum:
| (8) |
Specifically:
| (9) |
where is the true value, is the predicted value, and is the total number of elements.
| (10) |
where is the Pearson correlation coefficient.
| (11) |
where is the covariance of and , and and are the standard deviations of and . The formula for is as follows:
| (12) |
With and denoting the means, and the variances, the covariance, and and are small constants to stabilize the division.
Intuitively, prioritizes pixel-wise reconstruction accuracy, emphasizes structural similarity, and enforces the consistency correlation between the predicted and supervised image depth maps. Collectively, these components facilitate the generation of reconstructed views that closely mirror the target images in terms of both visual appearance and geometric fidelity.
2.4. Point Cloud Construction
Improved initialization of point clouds leads to more accurate scene geometry and reduces local overfitting during the reconstruction process. We propose a Fibonacci lattice arrangement [29] to obtain crucial information from the panoramic image.
Fibonacci camera placement. Let the golden angle be . For views and index, .
| (13) |
Uniformity justification. The “latitude” coordinates are chosen as midpoints of equal subdivisions in [−1, 1], so such latitude band covers the same surface area . The azimuths follow the golden angle progression , which is known to be equidistributed on the circle. Combining equal area bands with azimuthal equidistributional yields asymptotic uniformity on the sphere.
Comparative rationale. Under the same , latitude–longitude grids suffer polar crowding; icosahedral, geodesic meshes include pentagonal defects and mild anisotropy; and spherical Poisson disk requires iterative generation. In contrast, the spherical Fibonacci set combines equal area behavior, low discrepancy, and an analytic mapping from index to direction, offering a simple and robust initializer.
Camera intrinsics and calibration matrix. The internal reference matrix calculation method is based on the field of view angle:
| (14) |
The focal lengths and are calculated based on the horizontal and vertical fields of view and :
| (15) |
Panoramic mapping and point cloud formation. After initializing camera directions via , spherical UV coordinates are computed for each viewpoint by transforming the depth maps into a spherical coordinate system, through a direction to spherical coordinate conversion. To ensure smooth transitions between adjacent perspectives, a logarithmic transformation is applied to compress depth value variations. Subsequently, gradients of the logarithmic depth map are determined horizontally and vertically, denoted as representing the depth map from the i-th camera:
| (16) |
To further enhance the smoothness, Laplacian regularization is applied, penalizing abrupt depth discontinuities:
| (17) |
The refined panoramic depth map is then obtained by solving a least squares optimization that minimizes projection errors across multiple viewpoints while incorporating gradient and Laplacian regularization. The optimized log depth is restored via an inverse logarithmic transformation, yielding the panoramic depth maps corresponding to the generated panoramic images. Finally, the reconstructed depth values are projected into 3D space using spherical UV coordinates, yielding a detailed point cloud representation of the scene, as shown in Figure 3.
Figure 3.
Text to panoramic image generation and corresponding depth maps. The top row shows the generated panoramic images, and bottom row displays the corresponding depth maps.
2.5. Adaptive Gaussian Optimization
After generating the point cloud, the Gaussian Splatting points’ centers are initialized from the input point cloud, and their position and volume are refined using supervised image guidance. Prior research on reconstructing geometry from 3D Gaussian distributions has primarily relied on strong geometric regularization, typically using opacity as a criterion and discarding Gaussians with very low opacity. However, Gaussians with low contributions often exhibit relatively high opacity, making the default strategy inadequate to maintain geometry precision, particularly under the dense initial distributions. We identify Gaussians for pruning by determining their overall contribution as the average across several high contributing viewpoints. Surface Gaussians are further distinguished from internal ones by their transmittance properties, ensuring that surface-attached Gaussians are preserved even under low transmittance conditions. Training begins with a relaxation stage aimed at reducing pruning severity, which is gradually increased over time. This strategy retains a larger number of Gaussians during the early stages to support optimization, while progressively removing redundancies in later stages.
3. Experiments
3.1. Implementation Details
In accordance with the proposed pipeline, a dataset comprising eight textual prompts was constructed to comprehensively evaluate the performance of the proposed method across diverse indoor and outdoor scenarios. The prompts included six standard descriptions: “A mountain landscape”, “Waves on the beach”, “A luxury bathroom”, “A bedroom”, “Hulunbuir grassland with blue sky”, and “Beijing city library”. In addition, two more complex scenes were introduced to examine spatial complexity and geometric generalization, namely, “An indoor exhibition hall with multiple art installations, glass display cases, large posters on the wall, and spotlights” and, “An outdoor city plaza with a large central fountain, stone benches, tiled ground, and modern street lamps surrounded by open space.”
All experiments were conducted using Pytorch 2.4.0 with CUDA 12.4 on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB memory). Each training session consisted of 10,000 iterations to ensure stable convergence and consistent reconstruction performance.
To comprehensively evaluate our method, we employed three established reconstruction metrics: PSNR, SSIM, and LPIPS. These metrics collectively gauge pixel level accuracy, structural fidelity, and perceptual realism. Leveraging the dense pseudo labels that encompass the entire panoramic scene, we directly compared the reconstructed outputs with these pseudo labels to provide an objective and dependable assessment.
3.2. Comparison with Baselines
Baselines. We compared our method with three representative approaches to text-to-3D generation and panoramic scene reconstruction. LucidDreamer [30] iteratively enhances a single image and its textual prompt to generate multi-view consistent content, progressively expanding the scene to form a holistic view. To ensure fair comparison, we adapted its pipeline to accept an initial panoramic image as input and integrated our dense pseudo label supervision into its training. DreamScene360 [31] constructs immersive 360° panoramic scenes from textual prompts by projecting generated images into 3D environments. While it preserves global scene coherence, the projection process often leads to local geometric distortions, especially near high curvature regions. Scene4U [32] introduces a panoramic image-driven framework for immersive 3D scene reconstruction that enhances scene integrity by removing distracting elements. The method generates panoramic with specific spatiotemporal attributions, decomposes them into semantic layers, and refines each layer through inpainting and depth restoration before reconstructing a multi-layered 3D scene using 3DGS. Since the official implement is unavailable, we reproduced a variant following its multi-layer decomposition principle to ensure consistency within our framework.
Qualitative results. Figure 4 presents visual comparisons with baseline methods. Our method exhibits sharper textures, cleaner structural boundaries, and fewer rendering artifacts. In outdoor scenes, fine grained details in vegetation and terrain are preserved while maintaining global structural consistency. In indoor, object contours and furniture edges are preserved without the blurring and blocky distortions seen in baseline outputs.
Figure 4.
Representative examples, randomly selected from the test set, are shown for visual comparison. The white dashed boxes highlight the differences. The reconstruction quality of our method demonstrates superior recovery of both geometry and texture details.
Quantitative results. Figure 5 summarizes performance across all eight datasets in (a) PSNR, (b) SSIM, and (c) LPIPS, and scenes are indexed as Scene1–SceneN; the mapping to full scene name is provided in Table 1. In outdoor scenes like “Hulunbuir grassland with blue sky”, our method achieves over a 5 dB improvement in PSNR score. In indoor scenarios such as “A bedroom”, our method achieves the highest SSIM and lowest LPIPS, indicating better structural similarity and perceptual realism. We attribute these gains to the improved rendering capabilities of Gaussian Splatting and our initialization strategy, which seeds a higher, more uniformly distributed set of points, enabling accurate recovery of critical details throughout the scene.
Figure 5.
Quantitative comparison, (a) PSNR, (b) SSIM, (c) LPIPS. Results are reported on eight scenes indexed as Scene1–Scene8.
Table 1.
Mapping from scene indices (Scene1–Scene8) to full scene names.
| Scene1 | “A luxury bathroom” | Scene2 | “A bedroom” | Scene3 | “A mountain landscape” |
| Scene4 | “Hulunbuir grassland with blue sky” | Scene5 | “Waves on the beach” | Scene6 | “Beijing city library” |
| Scene7 | “An indoor exhibition hall with multiple art installations, glass display cases, large posters on the wall, and spotlights” | ||||
| Scene8 | “An outdoor city plaza with a large central fountain, stone benches, tiled ground, and modern street lamps surrounded by open space” | ||||
3.3. Ablation Study and Analysis
Ablation on point cloud initialization. We conducted ablation studies to evaluate the effectiveness of our point cloud initialization scheme against several mainstream approaches, including BiFuse, Depth anything V2 [33], VGGT [34], COLMAP, COLMAP (MVS), and FlowMap. Specifically, BiFuse fuses ERP and CubeMap projections through a dual-branch network; Depth anything V2 leverages large-scale pseudo labeled data for robust monocular depth estimation; VGGT introduces a geometry transformer capable of directly predicting depth and point clouds; COLMAP and COLMAP (MVS) provide classical sparse and dense reconstructions; and FlowMap jointly optimizes depth and camera parameters in a differentiable framework. As summarized in Table 2, our method achieves the best overall performance across all three metrics, with average scores of 42.07, 0.992, and 0.020, respectively.
Table 2.
Ablation study comparing our point cloud initialization method with the latest and most common techniques. The results, averaged across all datasets, demonstrate that our method achieves the best performance.
| Method | Average | ||
|---|---|---|---|
| PSNR | SSIM | LPIPS | |
| BiFuse | 23.86 | 0.693 | 0.330 |
| Depth anything V2 | 31.67 | 0.931 | 0.162 |
| VGGT | 35.80 | 0.904 | 0.213 |
| COLMAP | 38.64 | 0.945 | 0.079 |
| COLMAP (MVS) | 37.93 | 0.958 | 0.074 |
| FlowMap | 34.79 | 0.936 | 0.098 |
| Our study | 42.07 | 0.992 | 0.020 |
Generalization experiments on real world panoramic data. To further assess the generalization capability and robustness of the proposed approach in real world environments, a supplementary dataset was collected using a Teche 360 panoramic camera (Teche, Shenyang, China). The first scene was an indoor exhibition hall on the first floor, the second scene was the exterior area of the laboratory building, and the third scene was a public park adjacent to the university library, as illustrated in Figure 6. For each of the three scenes, we conducted comparative experiments to quantitatively assess reconstruction performance. The detailed numerical results are summarized in Table 3, where our method consistently outperforms other methods.
Figure 6.
Real world panoramic images captured by the Teche 360 camera: (R1) indoor exhibition hall, (R2) laboratory exterior, (R3) public park.
Table 3.
Quantitative comparison of reconstruction performance on real world panoramic datasets.
| Model | R1 | R2 | R3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | |
| LucidDreamer | 38.68 | 0.987 | 0.047 | 39.73 | 0.988 | 0.043 | 36.14 | 0.982 | 0.049 |
| DreamScene360 | 31.59 | 0.954 | 0.102 | 29.25 | 0.924 | 0.130 | 35.22 | 0.976 | 0.053 |
| Scene4U | 30.24 | 0.962 | 0.086 | 34.84 | 0.982 | 0.067 | 27.84 | 0.913 | 0.159 |
| Our study | 42.97 | 0.993 | 0.018 | 43.58 | 0.994 | 0.019 | 40.80 | 0.991 | 0.019 |
Teacher model comparison. To validate the rationality of selecting Moge2 as the teacher model in our distillation framework, we conducted comparative experiments using three alternative teachers: DPT, Metrics3D, and VGGT. All training settings and evaluation metrics were kept identical to ensure fairness. As summarized in Table 4, our method distilled from Moge2 achieves the highest reconstruction quality, while DPT performs significantly worse, and Metrics3D and VGGT yield slightly lower results. This gap demonstrates that the intrinsic errors and generalization ability of the teacher model critically influence the pseudo label quality and student performance.
Table 4.
Quantitative comparison of different teacher models in the distillation framework.
| Method | Average | ||
|---|---|---|---|
| PSNR | SSIM | LPIPS | |
| DPT | 32.93 | 0.954 | 0.128 |
| Metrics3D | 41.62 | 0.990 | 0.023 |
| VGGT | 41.67 | 0.991 | 0.022 |
| Our | 42.07 | 0.992 | 0.020 |
Effect of pseudo label quantity. We further conducted an ablation study with different numbers of pseudo label: 120, 180, 240, 300, and 360. As presented in Table 5, the results exhibit a varied trend: performance first improves as the number of pseudo labels increases, reaching its highest PSNR and SSIM at 180 samples, then slightly declines as more pseudo labels are added. With fewer pseudo labels, the student model is trained on a compact set of relatively clean labels, which reduces the influence of outliers and large teacher model errors. However, as the pseudo labels increase, the datasets become more diverse and cover a broader range of geometric structures, which theoretically enhances the generalization ability. At the same time, this results in an increase in training time and memory usage. Overall, these results suggest that the tradeoff between label representativeness and noise accumulation plays a key role in distillation performance. In our experiments, using around 240 samples achieves the best balance between supervision diversity, label reliability, and training efficiency.
Table 5.
Ablation study on the effect of pseudo label quantity.
| Average | |||||
|---|---|---|---|---|---|
| PSNR | SSIM | LPIPS | Time | GPU Usage | |
| 120 | 42.12 | 0.992 | 0.019 | 7 min 13 s | 3608.04 MB |
| 180 | 42.37 | 0.993 | 0.017 | 7 min 48 s | 4328.77 MB |
| 240 | 42.07 | 0.992 | 0.020 | 8 min 21 s | 5050.23 MB |
| 300 | 42.03 | 0.992 | 0.019 | 8 min 48 s | 5770.45 MB |
| 360 | 41.44 | 0.991 | 0.020 | 9 min 10 s | 6490.01 MB |
Ablation study on the number of Fibonacci sampling points. To analyze the impact of the number of Fibonacci sampling points on the results, we performed an ablation study testing different sampling point values with three metrics and reconstruction time. The results are summarized in Table 6.
Table 6.
Comparative quantitative analysis of different point configurations in Fibonacci sampling.
| Average | ||||
|---|---|---|---|---|
| PSNR | SSIM | LPIPS | Time | |
| Icosahedron (20) | 41.67 | 0.990 | 0.022 | 8 min 12 s |
| 15 | 41.58 | 0.990 | 0.023 | 8 min 6 s |
| 20 | 42.07 | 0.992 | 0.020 | 8 min 21 s |
| 25 | 42.24 | 0.991 | 0.022 | 8 min 37 s |
From the results, 20 points provided an optimal balance between the three metrics and reconstruction time. This choice aligns with common practices in the field, where 20 points sampling is widely adopted, and we also used the traditional icosahedron method as a baseline for comparison.
Under identical settings, we compare our optimizer with the traditional baseline, as shown in Table 7. The adaptive optimization achieves higher fidelity in nearly the same runtime, reducing artifacts, and improves consistency without adding significant computational cost.
Table 7.
Quantitative comparison between adaptive optimization and traditional 3DGS.
| Method | Average | |||
|---|---|---|---|---|
| PSNR | SSIM | LPIPS | Time | |
| Transitional Gaussian Optimization | 41.75 | 0.990 | 0.022 | 8 min 25 s |
| Adaptive Gaussian Optimization | 42.07 | 0.992 | 0.020 | 8 min 21 s |
4. Conclusions
In this work, we introduce a novel and effective framework that leverages the generative power of diffusion models to eliminate the reliance on multi-view imagery. Instead, our approach employs panoramic images as intermediate inputs to achieve globally consistent scene reconstruction. At the core of our design is a Fibonacci lattice-based initialization, which generates uniformly distributed point clouds from panoramic inputs and alleviates localized overfitting during reconstruction. In addition, we propose a dense pseudo label distillation strategy, where perspective projections derived from panoramas serve as supervisory signals, allowing the student model to inherit both structural and perceptual knowledge from a pretrained teacher model. Extensive experiments demonstrate that our framework consistently outperforms existing methods across diverse evaluation metrics. Future work will focus on enhancing higher resolution reconstruction and exploring dynamic scene modeling. One of the core challenges in dynamic scenes is handling motion blur, which can obscure fine details in the scene. Addressing this issue requires adapting our approach to effectively capture and reconstruct motion across frames, which will be an essential component in the development of immersive applications.
Acknowledgments
The authors would like to thank all the participants in this study.
Author Contributions
Conceptualization, H.Q., L.Y. and J.W.; methodology, H.Q., L.Y., M.S.A. and J.W.; software, H.Q.; validation, H.Q. and L.Y.; formal analysis, H.Q. and M.S.A.; resources, H.Q., L.Y., and J.W.; data curation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q., L.Y. and M.S.A.; supervision, H.Q., L.Y. and J.W.; project administration, H.Q., L.Y. and J.W.; funding acquisition, H.Q., L.Y. and J.W. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Informed consent was obtained from all the subjects involved in this study.
Data Availability Statement
Part of the dataset is available on request from the authors (the data are part of an ongoing project).
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.
Funding Statement
This work is supported by National Natural Science Foundation of China (62161040), Supported by Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT22056) and Science and Technology Project of Inner Mongolia Autonomous Region (2023YFSW0006), and Supported by the Fundamental Research Funds for Inner Mongolia University of Science and Technology (2023RCTD029).
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Leroy V., Cabon Y., Revaud J. Grounding image matching in 3d with mast3r; Proceedings of the European Conference on Computer Vision; Milan, Italy. 29 September–4 October 2024; pp. 71–91. [Google Scholar]
- 2.Ma Z., Teed Z., Deng J. Multiview stereo with cascaded epipolar raft; Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel. 23–27 October 2022; pp. 734–750. [Google Scholar]
- 3.Wang S., Leroy V., Cabon Y., Chidlovskii B., Revaud J. Dust3r: Geometric 3d vision made easy; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 20697–20709. [Google Scholar]
- 4.Zhang Z., Peng R., Hu Y., Wang R. Geomvsnet: Learning multi-view stereo with geometry perception; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada. 17–24 June 2023; pp. 21508–21518. [Google Scholar]
- 5.Martin-Brualla R., Radwan N., Sajjadi M.S.M., Barron J.T., Dosovitskiy A., Duckworth D. Nerf in the wild: Neural radiance fields for unconstrained photo collections; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 7210–7219. [Google Scholar]
- 6.Meng X., Chen W., Yang B. Neat: Learning neural implicit surfaces with arbitrary topologies from multi-view images; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada. 17–24 June 2023; pp. 248–258. [Google Scholar]
- 7.Wei Y., Liu S., Rao Y., Zhao W., Lu J., Zhou J. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 20–25 June 2021; pp. 5590–5599. [Google Scholar]
- 8.Tancik M., Weber E., Ng E., Li R., Yi B., Wang T., Kristoffersen A., Austin J., Salahi K., Ahuja A., et al. Nerfstudio: A modular framework for neural radiance field development; Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings; Los Angeles, CA, USA. 6–10 August 2023; pp. 1–12. [Google Scholar]
- 9.Yu Z., Chen A., Huang B., Sattler T., Geiger A. Mip-splatting: Alias-free 3d gaussian splatting; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 19447–19456. [Google Scholar]
- 10.He S., Ji P., Yang Y., Wang C., Ji J., Wang Y., Ding H. A survey on 3d gaussian splatting applications: Segmentation, editing, and generation. arXiv. 2025 doi: 10.48550/arXiv.2508.09977.2508.09977 [DOI] [Google Scholar]
- 11.Guédon A., Lepetit V. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 5354–5363. [Google Scholar]
- 12.Lin J., Li Z., Tang X., Liu J., Liu S., Liu J., Lu Y., Wu X., Xu S., Yan Y., et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 5166–5175. [Google Scholar]
- 13.Liu Y., Luo C., Fan L., Wang N., Peng J., Zhang Z. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians; Proceedings of the European Conference on Computer Vision; Milan, Italy. 29 September–4 October 2024; pp. 265–282. [Google Scholar]
- 14.Kerbl B., Meuleman A., Kopanas G., Wimmer M., Lanvin A., Drettakis G. A hierarchical 3d gaussian representation for real-time rendering of very large datasets. Acm T Graphic. 2024;43:62. doi: 10.1145/3658160. [DOI] [Google Scholar]
- 15.Cao C., Ren X., Fu Y. MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth. arXiv. 20222208.02541 [Google Scholar]
- 16.Ding Y., Yuan W., Zhu Q., Zhang H., Liu X., Wang Y., Liu X. Transmvsnet: Global context-aware multi-view stereo network with transformers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA. 18–24 June 2022; pp. 8585–8594. [Google Scholar]
- 17.Wang F., Galliani S., Vogel C., Pollefeys M. Itermvs: Iterative probability estimation for efficient multi-view stereo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA. 3–7 June 2022; pp. 8606–8615. [Google Scholar]
- 18.Jiang K., Fu Y., Varma T.M., Belhe Y., Wang X., Su H., Ramamoorthi R. A construct-optimize approach to sparse view synthesis without camera pose; Proceedings of the ACM SIGGRAPH Conference; Denver, CO, USA. 27 July –1 August 2024; pp. 1–11. [Google Scholar]
- 19.Chen Z., Yang J., Yang H. Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence. arXiv. 20242411.16877 [Google Scholar]
- 20.Tang H., Wang W., Gleize P., Feiszli M. Aden: Adaptive density representations for sparse-view camera pose estimation; Proceedings of the European Conference on Computer Vision; Milan, Italy. 20 September–4 October 2024; pp. 111–128. [Google Scholar]
- 21.Bar-Tal O., Yariv L., Lipman Y., Dekel T. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv. 2023 doi: 10.48550/arXiv.2302.08113.2302.08113 [DOI] [Google Scholar]
- 22.Wang H., Xiang X., Fan Y., Xue J.-H. Customizing 360-degree panoramas through text-to-image diffusion models; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA. 3–8 January 2024; pp. 4933–4943. [Google Scholar]
- 23.Wang R., Xu S., Dong Y., Deng Y., Xiang J., Lv Z., Sun G., Tong X., Yang J. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details. arXiv. 2025 doi: 10.48550/arXiv.2507.02546.2507.02546 [DOI] [Google Scholar]
- 24.Wang R., Xu S., Dai C., Xiang J., Deng Y., Tong X., Yang J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 11–15 June 2025; pp. 5261–5271. [Google Scholar]
- 25.Schonberger J.L., Frahm J.M. Structure-from-motion revisited; Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016. [Google Scholar]
- 26.Smith C., Charatan D., Tewari A., Sitzmann V. Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent; Proceedings of the International Conference on 3D; Vision Davos, Switzerland. 18–21 March 2024; pp. 389–400. [Google Scholar]
- 27.Wu T., Yuan Y.J., Zhang L.X., Yang J., Cao Y.P., Yan L.Q., Gao L. Recent advances in 3d gaussian splatting. Comput. Vis. Media. 2024;10:613–642. doi: 10.1007/s41095-024-0436-y. [DOI] [Google Scholar]
- 28.Hu E.J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., Wang L., Chen W. Lora: Low-rank adaptation of large language models. ICLR. 2022;1:3. [Google Scholar]
- 29.Frisch D., Hanebeck U.D. Deterministic gaussian sampling with generalized fibonacci grids; Proceedings of the IEEE 24th International Conference on Information Fusion; Sun City, South Africa. 1–4 November 2021; pp. 1–8. [Google Scholar]
- 30.Chung J., Lee S., Nam H., Lee J., Lee K.M. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv. 2023 doi: 10.1109/TVCG.2025.3611489.2311.13384 [DOI] [PubMed] [Google Scholar]
- 31.Zhou S., Fan Z., Xu D., Chang H., Chari P., Bharadwaj T., You S., Wang Z., Kadambi A. Dreamscene360: Unconstrained text-to-3d scene generation with panoramic gaussian splatting; Proceedings of the European Conference on Computer Vision; Milan, Italy. 29 September –4 October 2024; pp. 324–342. [Google Scholar]
- 32.Huang Z., He J., Ye J., Jiang L., Li W., Chen Y., Han T. Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 11 –15 June 2025; pp. 26723–26733. [Google Scholar]
- 33.Yang L., Kang B., Huang Z., Zhao Z., Xu X., Feng J., Zhao H. Depth anything v2. Adv. Neural Inf. Process. Syst. 2024;37:21875–21911. [Google Scholar]
- 34.Wang J., Chen M., Karaev N., Vedaldi A., Rupprecht C., Novotny D. Vggt: Visual geometry grounded transformer; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 11 –15 June 2025; pp. 5294–5306. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Part of the dataset is available on request from the authors (the data are part of an ongoing project).






