Abstract
High-fidelity 4D reconstruction of dynamic scenes is pivotal for immersive simulation yet remains challenging due to the photometric inconsistencies inherent in multi-view sensor arrays. Standard 3D Gaussian Splatting (3DGS) strictly adheres to the brightness constancy assumption, failing to distinguish between intrinsic scene radiance and transient brightness shifts caused by independent auto-exposure (AE), auto-white-balance (AWB), and non-linear ISP processing. This misalignment often forces the optimization process to compensate for spectral discrepancies through incorrect geometric deformation, resulting in severe temporal flickering and spatial floating artifacts. To address these limitations, we present Lumina-4DGS, a robust framework that harmonizes spatiotemporal geometry modeling with a hierarchical exposure compensation strategy. Our approach explicitly decouples photometric variations into two levels: a Global Exposure Affine Module that neutralizes sensor-specific AE/AWB fluctuations and a Multi-Scale Bilateral Grid that residually corrects spatially varying non-linearities, such as vignetting, using luminance-based guidance. Crucially, to prevent these powerful appearance modules from masking geometric flaws, we introduce a novel SSIM-Gated Optimization mechanism. This strategy dynamically gates the gradient flow to the exposure modules based on structural similarity. By ensuring that photometric enhancement is only activated when the underlying geometry is structurally reliable, we effectively prioritize geometric accuracy over photometric overfitting. Extensive experiments validate the quantitative superiority of Lumina-4DGS. On the Waymo Open Dataset, our method achieves a state-of-the-art Full Image PSNR of 31.12 dB while minimizing geometric errors to a Depth RMSE of 1.89 m and Chamfer Distance of 0.215 m. Furthermore, on our highly challenging self-collected surround-view dataset featuring severe unconstrained illumination shifts, Lumina-4DGS yields a significant 2.13 dB PSNR improvement over recent driving-scene baselines. These results confirm that our framework achieves photorealistic, exposure-invariant novel view synthesis while maintaining superior geometric consistency across heterogeneous camera inputs.
Keywords: 4D Gaussian splatting, dynamic scene reconstruction, auto-exposure compensation, photometric consistency, novel view synthesis, multi-camera fusion
1. Introduction
High-fidelity 4D reconstruction of dynamic scenes is a cornerstone of next-generation applications in virtual reality (VR), immersive simulation, and autonomous driving [1,2,3,4,5,6,7]. To ensure safety and realism, these systems require not only photorealistic novel view synthesis but also precise 3D geometric modeling of complex, moving environments. While Neural Radiance Fields (NeRFs) [8,9,10,11] have set high standards for rendering quality, their prohibitive computational costs hinder real-time deployment. Recently, 3D Gaussian Splatting (3DGS) [12] has emerged as a paradigm shift, enabling real-time rendering and rapid training by representing scenes with explicit, anisotropic 3D Gaussians. Despite its efficiency, applying 3DGS to real-world driving datasets reveals critical limitations rooted in the complex interplay between photometric inconsistency and geometric fidelity.
A fundamental challenge in reconstructing outdoor driving scenes is the violation of the brightness constancy assumption inherent in standard reconstruction pipelines. In multi-camera sensor arrays, independent auto-exposure (AE) and auto-white-balance (AWB) mechanisms induce significant brightness shifts across viewpoints and timestamps [13,14,15,16]. Furthermore, non-linear ISP effects and spatially varying lighting (e.g., shadows, vignetting) introduce local inconsistencies. As noted in recent studies, standard 3DGS fails to distinguish these transient photometric shifts from intrinsic scene radiance [12,17]. Consequently, the optimization process is forced to “cheat”: it compensates for spectral discrepancies by deforming the underlying geometry or generating “floater” artifacts to match the varying input images [4,18]. This results in severe temporal flickering and, more critically, compromised geometric accuracy, which is unacceptable for downstream tasks like obstacle avoidance or path planning.
To mitigate these issues, prior works have introduced appearance embeddings [2,19,20] or global affine transformations [18] to model exposure changes. While effective for global shifts, these methods struggle to capture high-frequency, spatially variant discrepancies common in dynamic environments. More recent approaches utilize bilateral grids for pixel-wise adjustments [21,22]. However, standard bilateral grids are notoriously difficult to optimize and prone to overfitting, often converging to unstable solutions that disrupt the scene’s structural coherence. Simultaneously, 3DGS itself suffers from a lack of geometric constraints; the discrete and unordered nature of Gaussians often leads to surfaces that are “fuzzy” or poorly aligned with ground truth depth, as they rely solely on photometric loss for supervision.
In this paper, we propose Lumina-4DGS, a robust framework that harmonizes spatiotemporal geometry modeling with a unified, hierarchical exposure compensation strategy. We argue that effective reconstruction requires explicitly decoupling intrinsic scene color from sensor-specific variations, but this decoupling must be strictly constrained to prevent appearance models from eroding geometric integrity. Our method integrates two key innovations. First, we propose a Spatiotemporal Hybrid Exposure Model centered on a Multi-Scale Bilateral Grid [22]. Unlike standard grids that struggle with optimization instability, our multi-scale design bridges the gap between global appearance codes and pixel-wise transformations. By unifying these paradigms, it adaptively captures broad exposure shifts at coarse scales while residually correcting fine-grained non-linearities at fine scales, all while imposing temporal smoothness constraints to prevent flickering. Second, to address the trade-off between photometric correction and geometric stability, we introduce a novel SSIM-Gated Optimization Mechanism. This strategy dynamically gates the gradient flow to the exposure modules based on Structural Similarity, ensuring that photometric enhancement is only activated when the underlying geometry is structurally reliable.
In summary, our contributions are as follows:
-
(i)
We propose a unified spatiotemporal exposure framework that effectively integrates a Global Exposure Affine Module with the Multi-Scale Bilateral Grid. By enforcing temporal smoothness constraints while decoupling sensor-level shifts from local variations, our framework ensures flicker-free rendering and robust convergence across heterogeneous cameras.
-
(ii)
We introduce a geometry-aware SSIM-Gated Optimization strategy to address the geometric degradation caused by powerful appearance models. By dynamically regulating the multi-scale grid based on structural similarity, we mitigate the texture-geometry ambiguity, achieving high-fidelity reconstruction without improving photometric scores at the expense of geometric accuracy.
-
(iii)
We validate our approach through extensive benchmarking on both public datasets (Waymo [23]) and a challenging self-collected driving dataset, showcasing notable improvements in rendering realism, temporal stability, and quantitative geometric metrics compared to existing baselines.
The remainder of this paper is organized as follows: Section 2 reviews the related work in dynamic scene reconstruction and exposure compensation. Section 3 details the methodology of our proposed Lumina-4DGS framework. Section 4 presents the experimental setup, quantitative and qualitative evaluations, and ablation studies. Section 5 discusses the broader implications and limitations of our method, and finally, Section 6 concludes the paper.
2. Related Work
2.1. Dynamic Scene Modeling for Autonomous Driving
Reconstructing dynamic driving environments is critical for simulation and autonomous system validation. Early approaches [4,24] utilized Neural Radiance Fields (NeRF) to model static backgrounds and dynamic objects separately. For instance, MARS [24] employs a modular NeRF framework, while NeuRAD [4] integrates sensor-specific effects like rolling shutter to improve realism. However, NeRF-based methods suffer from slow training and rendering speeds.
Recently, 3D Gaussian Splatting (3DGS) [12] has revolutionized this field with real-time performance. Methods like Street Gaussian [25] and DrivingGaussian [6] leverage explicit 3D Gaussians to represent urban scenes, enabling efficient rendering of dynamic agents. OmniRe [1] further constructs hierarchical scene representations to unify static backgrounds and dynamic entities. Beyond general driving scenarios, recent works have extended 3DGS to specialized downstream tasks. For example, ParkGaussian [26] introduces a slot-aware strategy to enhance reconstruction for parking slot perception in GPS-denied environments, while other research [27] combines 3DGS with adversarial domain adaptation to enable monocular robot navigation via sim-to-real transfer. Despite these advancements in scene representation and task-specific applications, most existing pipelines assume consistent illumination across views. When applied to multi-camera setups in the wild, independent auto-exposure (AE) and auto-white-balance (AWB) mechanisms break this assumption, leading to severe flickering and geometric artifacts [14,19,28]. Our work builds upon these foundations but specifically addresses the photometric inconsistencies inherent in raw sensor data to achieve robust 4D reconstruction.
2.2. Photometric Inconsistency and Appearance Modeling
To handle varying illumination and transient discrepancies (e.g., shadows, exposure shifts), “appearance embeddings” were popularized by NeRF-W [19] and subsequently adopted in 3DGS frameworks like WildGaussians [2] and SWAG [29]. Building on this paradigm, recent advancements have introduced more specialized mechanisms to address complex lighting artifacts. RobustSplat++ [30] identifies that standard Gaussian densification can overfit to transient illumination, proposing a delayed growth strategy combined with robust appearance modeling to decouple structural geometry from lighting disturbances. Similarly, in the context of endoscopic reconstruction, Endo-4DGX [31] tackles extreme low-light and over-exposure conditions by incorporating illumination embeddings with region-aware spatial adjustment modules. While these approaches represent significant progress in handling global style changes or domain-specific exposure challenges, they typically rely on latent codes or specialized training schedules. Consequently, they often lack the direct granularity to explicitly model high-frequency, spatially varying discrepancies such as vignetting or local contrast shifts [32] inherent in large-scale driving datasets.
Bilateral Grids [18,22] have long been established as powerful tools for edge-aware image enhancement. In the realm of neural rendering, recent methodologies [8,18,33] have adopted these grids to model spatially varying photometric effects. However, standard bilateral grids are high-dimensional and notoriously prone to optimization instability or overfitting when lacking sufficient constraints [22]. Distinct from these prior approaches that often rely on the grid to model the full spectrum of appearance changes, we adopt a hierarchical strategy. We limit the bilateral grid to specific local non-linearities while offloading sensor-level shifts to a global affine module. This explicit decoupling addresses the convergence issues inherent in previous grid-based methods, ensuring robust performance even under heterogeneous camera setups.
2.3. Geometric Consistency and Surface Reconstruction
Accurate geometry is pivotal for downstream tasks like obstacle avoidance. However, standard 3DGS is prone to geometric degradation, often representing surfaces as “fuzzy” point clouds or creating floating artifacts to minimize photometric loss [34,35]. Recent efforts like SuGaR [34] and 2DGS [36] attempt to improve geometry by explicitly enforcing surface constraints or employing planar primitives. SuGaR introduces density regularization to extract meshes, while 2DGS flattens Gaussians into disks to resolve geometric ambiguities in ray intersection.
However, these methods predominantly focus on the geometric representation itself, often overlooking the critical impact of photometric inconsistency on geometric convergence. In dynamic scenes with fluctuating exposure, powerful appearance models can inadvertently “explain away” geometric errors—a phenomenon known as texture-geometry ambiguity. For instance, a shadow or exposure shift might be incorrectly modeled as a geometric deformation rather than a lighting change. Unlike prior works that treat geometry and appearance optimization in isolation, we introduce a Geometry-Aware Optimization strategy. By gating the gradient flow based on structural similarity (SSIM), we ensure that photometric enhancements are applied only when the underlying structure is reliable, effectively preventing appearance models from corrupting the scene geometry.
3. Methodology
We present Lumina-4DGS, a robust framework designed to achieve high-fidelity 4D reconstruction from heterogeneous camera inputs. Built upon the foundation of the Dynamic Gaussian Scene Graph [6], our approach addresses the critical limitation of standard scene graph representations: their inability to decouple intrinsic scene radiance from transient, sensor-specific photometric variations (e.g., auto-exposure and white balance shifts).
As illustrated in Figure 1, we adopt a composite Gaussian Scene Graph as the geometric backbone, decomposing the complex environment into Sky, Background, and Dynamic Object nodes. While this graph structure effectively handles scene dynamics, direct optimization against inconsistent observations leads to geometric artifacts. To overcome this, we augment the scene graph rendering pipeline with a Hierarchical Exposure Compensation strategy. This module explicitly models the image formation process by coupling a global sensor-level affine transformation with a local multi-scale bilateral grid. Furthermore, to ensure that these appearance enhancements do not compromise the structural integrity of the scene graph, we introduce a Geometry-Aware SSIM-Gated Optimization strategy, which selectively gates gradients based on geometric reliability.
Figure 1.
Overview of the Lumina-4DGS Framework. The scene is modeled via a composite Gaussian Scene Graph and rendered to produce a raw image (Middle). A Hierarchical Exposure Compensation stage normalizes using a Global Exposure Module for sensor-level shifts () and a Multi-Scale Bilateral Grid for local non-linearities, yielding the final image (Right). Optimization is controlled by an SSIM-Gated Mechanism, which enforces temporal smoothness and dynamically gates gradient flow to ensure appearance enhancements do not compromise geometric structural reliability.
The remainder of this section is organized as follows: Section 3.1 formulates the reconstruction problem. Section 3.2 details our underlying Dynamic Gaussian Scene Graph representation. Section 3.3 introduces the Hierarchical Exposure Compensation mechanism. Finally, Section 3.4 describes the SSIM-gated optimization strategy.
3.1. Preliminaries: 3D Gaussian Splatting
We represent the static scene as a set of 3D Gaussians . Each Gaussian is defined by a center , a covariance matrix , an opacity , and view-dependent color coefficients (Spherical Harmonics). To ensure remains positive semi-definite during optimization, it is decomposed into a rotation matrix (parameterized by a quaternion ) and a scaling matrix (parameterized by a vector ):
| (1) |
Given a camera with viewing transformation and projective Jacobian , the 3D covariance is projected onto the 2D image plane as :
| (2) |
The pixel color at pixel coordinate is computed via volume rendering (-blending). We let be the set of sorted Gaussians overlapping the pixel. The rendered color is accumulated as:
| (3) |
where is the viewing direction and is the 2D alpha contribution evaluated at :
| (4) |
Here, represents the intrinsic scene radiance, which is ideally consistent across views. However, in multi-camera driving datasets, the observed ground truth images are not a direct reflection of this intrinsic radiance due to independent auto-exposure (AE) and auto-white-balance (AWB) mechanisms. We model the observed image as:
| (5) |
where represents a complex, non-linear transformation that varies across camera c and timestamp t. Standard 3DGS minimizes the photometric error between and directly, forcing the Gaussians to bake these transient sensor effects into the geometry, causing floating artifacts. Our goal is to model explicitly to recover a consistent geometry.
3.2. Dynamic Gaussian Scene Graph Construction
To scale 3DGS to large-scale, dynamic driving environments, we construct a composite Dynamic Gaussian Scene Graph . As illustrated in Figure 2, this graph explicitly disentangles the scene into three semantic node types—Sky, Background, and Dynamic Objects—allowing us to incorporate rigorous kinematic constraints and decouple object motion from the static environment.
Figure 2.
Scene Graph Decomposition (Figure 2 Revision). We decouple the scene into three primitives based on Perception v1.2 priors: (1) Sky (Blue); (2) Dynamic (Red); (3) Background (Gray). This structure enforces kinematic constraints for illumination-robust reconstruction.
The global scene at time t is composed of the union of these nodes:
| (6) |
where represents the static urban geometry, models the far-field environment, and denotes the set of visible dynamic agents at time t.
3.2.1. Graph Node Definitions
Sky Node (): We model the sky using a Far-Field Environment Map representation. To address the infinite depth of the sky, we initialize as a set of Gaussians distributed on a large bounding sphere with radius . These Gaussians are translation-invariant relative to the camera, with their appearance dependent solely on the viewing direction . This handles the high-dynamic-range background without introducing depth artifacts.
Background Node (): The static urban environment (e.g., roads, buildings, vegetation) is represented by stationary 3D Gaussians in the world frame. Their parameters optimize the time-invariant geometry of the scene, providing a stable geometric backbone.
Dynamic Node (): Moving agents (vehicles, pedestrians) are handled via object-centric graphs. Instead of modeling them in world space directly, we maintain a set of canonical Gaussians in a local coordinate system for each object k. This allows the model to share geometric features across timestamps.
3.2.2. Rigid and Deformable Object Modeling
To accurately render dynamic agents, we map the canonical Gaussians to world space using timestamp-specific transformations.
Rigid Motion for Vehicles. For rigid objects such as cars, we utilize the tracked 6-DoF pose derived from off-the-shelf trackers. We explicitly transform the canonical parameters into world space. The world-space mean and rotation quaternion for the ith Gaussian of object k are computed as:
| (7) |
| (8) |
where ⊗ denotes quaternion multiplication and is the quaternion representation of . This formulation ensures that multi-view consistency is enforced via the object’s kinematic trajectory.
Deformable Motion for VRUs. Vulnerable Road Users (VRUs) like pedestrians exhibit non-rigid articulation. To handle this, we extend the rigid formulation with a time-dependent deformation field . We predict coordinate offsets and covariance corrections in the canonical space:
| (9) |
The final world-space position is obtained by applying the rigid pose to the deformed Gaussian:
| (10) |
This hybrid approach effectively decouples global trajectory from local articulated dynamics.
3.2.3. Graph Composition and Rasterization
At each rendering step, the dynamic scene graph is traversed to generate a unified set of 3D Gaussians in the world coordinate system. We let denote the transformation operator mapping local node parameters to world space. The composite scene is constructed as:
| (11) |
The rasterizer aggregates these nodes to produce the canonical image:
| (12) |
Crucially, aims to represent the consistent scene radiance before sensor processing. By explicitly separating dynamics from the static background in the graph, we can enforce strict geometric consistency constraints during composition (e.g., preventing dynamic objects from penetrating the static ground plane).
3.3. Hierarchical Exposure Compensation
As formulated in Equation (6), the observed image is contaminated by sensor-specific photometric variations. Direct optimization against these inconsistent observations forces 3D Gaussians to “bake in” transient lighting effects, resulting in “floater” artifacts. To resolve this, we propose a Hierarchical Exposure Compensation mechanism that explicitly models the camera response function (CRF). We decompose the mapping into a physically motivated two-stage pipeline:
| (13) |
This hierarchical design ensures that high-frequency local corrections are only applied after global histogram alignment, preventing the powerful local model from overfitting to global shifts.
3.3.1. Level 1: Global Exposure Affine Module
The primary source of photometric inconsistency in driving scenarios is the automatic adjustment of ISO gain and shutter speed (AE). We model this as a global, channel-wise affine transformation. For each camera c at timestamp t, we optimize a learnable gain embedding and a bias embedding . The intermediate globally compensated image is computed as:
| (14) |
where ⊙ denotes the element-wise Hadamard product. Physical Constraints: Crucially, we apply the exponential function to the gain vector. This enforces a strict positivity constraint (), consistent with the physics of photon accumulation, ensuring the adjusted radiance remains valid. This module effectively neutralizes broad histogram shifts and white balance discrepancies.
3.3.2. Level 2: Multi-Scale Bilateral Grid
While the global affine module addresses sensor-level shifts, it remains insufficient for spatially heterogeneous artifacts, such as lens vignetting and local tone mapping inconsistencies. To rectify these pixel-wise residuals while preserving high-frequency geometry, we introduce a Multi-Scale Bilateral Grid (Figure 3). Unlike heavy convolutional networks that risk overfitting or blurring textures, the bilateral grid offers an edge-aware, computationally efficient solution for real-time rendering.
Figure 3.
Overview of the Multi-Scale Bilateral Grid. We lift 2D pixels into a 3D bilateral space using spatial coordinates and luminance guidance. Local affine matrices are retrieved via slicing and applied residually to the globally compensated image, effectively correcting spatially variant photometric distortions.
Bilateral Grid Parameterization. We parameterize the local photometric response as a learnable 3D tensor . The grid dimensions and correspond to the spatial and luminance resolutions, respectively. Each voxel stores a flattened affine transformation matrix, allowing the grid to model complex local color twists rather than simple scalar scaling.
Content-Adaptive Slicing. To enable edge-aware filtering, the correction for any given pixel is conditioned on both its spatial location and its photometric intensity. We first extract a monochromatic guidance map from the globally aligned image :
| (15) |
This guidance map lifts the 2D pixel coordinates into a 3D query space , where are normalized spatial coordinates and . We then retrieve a pixel-specific affine matrix via a differentiable trilinear interpolation (slicing) operator :
| (16) |
Multi-Scale Residual Fusion. Photometric inconsistencies often manifest at varying frequencies—vignetting is globally smooth, whereas tone-mapping artifacts can be sharp. To capture this spectrum, we employ a multi-scale hierarchy with K grid levels (typically , with resolutions and ). The final compensated image is synthesized by accumulating residual corrections:
| (17) |
Here, the affine matrix operates on the homogeneous representation of the pixel color. This residual formulation ensures that the grid focuses solely on local non-linear refinements, maintaining the structural fidelity of the original Gaussian rendering.
3.4. Optimization Strategy
The complete forward rendering and gated optimization pipeline of Lumina-4DGS, which explicitly integrates independent auto-exposure (AE) and auto-white-balance (AWB) compensation, is summarized in Algorithm 1.
| Algorithm 1 Forward Rendering and Gated Optimization in Lumina-4DGS |
|
3.4.1. Object-Aware SSIM-Gating and the Brightness Constancy Assumption
A fundamental challenge in joint geometry-appearance optimization for 4D reconstruction is the frequent violation of the brightness constancy assumption. In dynamic driving scenes captured with independent auto-exposure (AE), this photometric inconsistency often triggers visual overfitting. When optimized without strict constraints, powerful appearance models (like our Multi-Scale Bilateral Grid) tend to misinterpret inter-frame brightness shifts as geometric density. This texture-geometry ambiguity causes the model to hallucinate textures onto dilated, erroneous geometric boundaries (an artifact of LiDAR voxelization), leading to severe ghosting and floating artifacts around dynamic objects.
To address this and preserve structural integrity while relaxing the brightness constancy requirement, we propose an Object-Aware SSIM-Gated Optimization strategy. Instead of applying a global constraint, our gating mechanism is applied at the object level. Utilizing 2D masks, we calculate the Structural Similarity Index (SSIM) independently for dynamic foreground objects and the static background. The SSIM between the raw geometry rendering and the ground truth serves as a dynamic proxy for geometric boundary reliability. We introduce an object-specific gating mask :
| (18) |
where is a progressive confidence threshold. The appearance gradients are modulated by this mask: . This ensures that if the structural-photometric mismatch is too high, the appearance model is frozen to prevent deforming the 4D geometry to minimize RGB loss.
Justification of the Thresholding Strategy (): Rather than using a static empirical value, the threshold is linearly annealed from to over the first 10,000 iterations. This dynamic curriculum is deeply grounded in the densification behavior of 3D Gaussian Splatting:
Lower Bound (0.2): During the earliest iterations, the initial geometry projected from sparse LiDAR is highly chaotic, yielding an SSIM noise floor near 0.2. This lower bound strictly prevents the bilateral grid from compensating for catastrophic geometric initialization.
Upper Bound (0.7): An of 0.7 indicates that the macro-structures and coarse object boundaries have sufficiently aligned with true image edges, safely overcoming the LiDAR voxel dilation. Beyond this point, residual errors are primarily dominated by sensor-specific illumination shifts, making it the mathematically optimal time to fully unfreeze the appearance model.
10k-Iteration Annealing: This smoothly synchronizes the appearance unfreezing with the most active geometric densification phase (splitting and cloning) of the 3D Gaussians.
Comparison with Simpler Alternatives: While gradient clipping restricts update magnitude, it does not prevent optimizing in the wrong direction. Conversely, a confidence-weighted loss slows down the learning of both appearance and geometry. Our SSIM gating strictly decouples the two: it entirely masks the appearance gradients () to prevent hallucinatio while maintaining full gradient flow for the 3D Gaussians to rapidly correct their geometry against true 2D object boundaries.
3.4.2. Spatiotemporal Smoothness Constraints
To mitigate temporal flickering caused by independent per-frame optimization, we enforce smoothness constraints on the exposure parameters. Since auto-exposure (AE) and auto-white-balance (AWB) typically evolve smoothly over time, abrupt changes are penalized. We formulate the temporal smoothness loss by decoupling it into global and local components:
| (19) |
where represent the affine gain and bias at time t, and represents the coefficients of the bilateral grid. The first term enforces global exposure continuity, while the second ensures that spatially varying corrections (e.g., vignetting patterns) remain stable.
3.4.3. Initialization and Staged Training Schedule
Given that high-dimensional bilateral grids are prone to overfitting during early optimization, we treat our training schedule as an explicit structural regularizer. The Global Exposure Module is initialized to an identity mapping (), and all voxels within the Multi-Scale Bilateral Grid output identity affine matrices ().
We employ a delayed activation strategy. For the first 6000 iterations, the bilateral grid is completely frozen. This forces the optimization to rely solely on the Global Module to absorb massive, image-wide sensor shifts. Once global photometry is stabilized at iteration 6000, the multi-scale grid is unfrozen to learn only minimal, localized residual corrections, empirically eliminating severe color flickering.
3.4.4. Total Objective and Evaluation Rationale
The final training objective combines the standard reconstruction loss with our temporal and structural regularization terms:
| (20) |
To prevent the appearance models from generating degenerate color twists, penalizes the magnitude of exposure adjustments, forcing them to remain close to the identity mapping:
| (21) |
where denotes the flattened affine matrix at voxel v and is the identity matrix.
By explicitly modeling the photometric transformation (as detailed in Algorithm 1), our framework ensures that the underlying 4D geometry remains stable even when the brightness constancy assumption fails. Photometric inconsistency in dynamic driving scenes often triggers visual overfitting, where the model generates floating artifacts to compensate for inter-frame brightness shifts. This mismatch ultimately manifests as a degradation in standard metrics such as PSNR, SSIM, and LPIPS. Our subsequent evaluation thus focuses heavily on these photometric parameters, alongside geometric metrics (Depth RMSE and Chamfer Distance), to strictly validate the efficacy of our exposure-decoupled rendering and ensure improvements are grounded in physical structural integrity.
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
To validate the robustness of Lumina-4DGS under heterogeneous illumination conditions, we conduct experiments on two distinct datasets:
-
1.
Waymo Open Dataset [23]: We utilize the official Perception v1.2 release with scene flow labels. Specifically, we evaluate on 12 challenging sequences (sequences 000–005 and 010–015), featuring significant lighting variations under rainy and overcast conditions, totaling approximately 1200 frames.
-
2.
Custom Surround-View Dataset: To evaluate performance in unconstrained in-the-wild scenarios, we collected data using a vehicle-mounted rig of 6 cameras. The configuration is heterogeneous: the front-view and rear-view sensors capture at , while the four side-view sensors (front-left/right, rear-left/right) capture at . All cameras operate with fully independent auto-exposure (AE) enabled. The dataset is characterized by rapid inter-frame brightness shifts and extreme dynamic range changes across the 360° field of view. To facilitate geometric evaluation, we utilize accumulated LiDAR point clouds as the absolute ground truth.
4.1.2. Evaluation Metrics
Photometric inconsistency in dynamic driving scenes often triggers visual overfitting, where the model generates floating artifacts (e.g., volumetric fog) to compensate for inter-frame brightness shifts. Such structural-photometric mismatches ultimately manifest as a degradation in standard fidelity metrics. Therefore, we focus on both photometric and geometric parameters to validate the efficacy of our exposure-decoupled rendering.
Photometric Metrics: We report PSNR (↑), SSIM (↑), and LPIPS (↓). These metrics quantify the suppression of noise and flickering caused by rapid auto-exposure (AE) shifts. All metrics are computed between the final compensated rendering and the ground truth sensor images at their native resolutions.
-
Geometric Metrics (LiDAR-based): To strictly validate the physical correctness and eliminate the aforementioned floating artifacts, we utilize LiDAR point clouds as absolute ground truth:
-
-Depth RMSE (↓): Measures the root mean square error between rendered depth and projected LiDAR depth :
where denotes the set of pixels with valid LiDAR readings.(22) -
-Chamfer Distance (CD) (↓): To assess 3D structural consistency beyond the 2D plane, we calculate the CD between the reconstructed Gaussian cloud and the GT LiDAR cloud :
(23) This metric explicitly penalizes geometric deformations that do not align with physical measurements.
-
-
4.1.3. Baselines
We compare our method against state-of-the-art view synthesis approaches:
3DGS [12]: The vanilla 3D Gaussian Splatting baseline.
StreetGS [25]: A representative dynamic urban scene reconstruction method based on 3DGS.
OmniRe [1]: A state-of-the-art framework that constructs hierarchical scene representations to unify static backgrounds and dynamic entities.
Uni-BG [22]: The recent state-of-the-art method (Wang et al., 2025) that unifies appearance codes and bilateral grids for driving scenes, serving as the direct baseline for our exposure compensation module.
4.1.4. Implementation Details
We implement Lumina-4DGS using PyTorch 2.0.1. Following the staged training schedule described in Section 3.4.3, the bilateral grid remains frozen for the first 6000 iterations and is subsequently unfrozen. The Multi-Scale Bilateral Grid is configured with a spatial resolution of and a luminance resolution of 8. The SSIM gating threshold is linearly annealed from to during the first 10k iterations. We train for 30k iterations on a single NVIDIA RTX 4090 GPU.
4.2. Comparative Analysis
4.2.1. Quantitative Evaluation on Public Benchmark
Table 1 summarizes the quantitative results on the Waymo Open Dataset. We compare against standard 3DGS, dynamic reconstruction methods (StreetGS [25]), and the recent state-of-the-art frameworks OmniRe [1] and Uni-BG [22]. To highlight the efficacy of our exposure compensation in recovering fine-grained details, we report metrics for the Full Image as well as specific foreground categories: Human and Vehicle.
Table 1.
Quantitative comparison on the Waymo Open Dataset. Box indicates the use of 3D bounding boxes; LiDAR indicates the use of LiDAR for geometric supervision. Our method achieves significant gains in dynamic object fidelity (Human/Vehicle) by resolving exposure-induced artifacts. Best results are bolded.
| Methods | Box | LiDAR | Full Image | Human | Vehicle | Geometry | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | RMSE ↓ | CD ↓ | |||
| 3DGS [12] | - | - | 26.00 | 0.912 | 16.88 | 0.414 | 16.18 | 0.425 | 2.80 | 0.415 |
| StreetGS [25] | ✓ | ✓ | 29.08 | 0.936 | 16.83 | 0.420 | 27.73 | 0.880 | 2.20 | 0.274 |
| OmniRe [1] | ✓ | ✓ | 29.85 | 0.938 | 24.10 | 0.710 | 28.20 | 0.875 | 2.05 | 0.245 |
| Uni-BG [22] | ✓ | ✓ | 30.96 | 0.941 | 23.45 | 0.650 | 26.10 | 0.795 | 2.15 | 0.228 |
| Lumina-4DGS (Ours) | ✓ | ✓ | 31.12 | 0.956 | 28.15 | 0.845 | 28.91 | 0.892 | 1.89 | 0.215 |
Our method, Lumina-4DGS, outperforms all baselines across both full-frame and category-specific metrics. Notably, while Uni-BG [22] also utilizes multi-scale bilateral grids for appearance modeling, our hierarchical approach achieves a superior Full Image PSNR of 31.12 dB.
The most significant advantage is observed in the Human and Vehicle categories. Dynamic objects are frequently captured under rapidly changing relative illumination as they move; our hierarchical module effectively stabilizes these fluctuations. For instance, on the Human category, our method reaches 28.15 dB, which is a significant improvement over OmniRe (24.10 dB) and Uni-BG (23.45 dB). This demonstrates that our hierarchical decoupling prevents the model from misinterpreting exposure shifts as geometric blur on dynamic subjects.
Furthermore, regarding geometric accuracy, high photometric scores do not always correlate with correct geometry. While Uni-BG and OmniRe achieve high full-image PSNR, they suffer from higher Depth RMSE (2.15 m and 2.05 m, respectively). Thanks to our SSIM-Gated Optimization, Lumina-4DGS achieves the lowest Depth RMSE (1.89 m) and Chamfer Distance (0.215 m), proving that our method improves visual quality through physically grounded exposure modeling rather than geometric deformation.
Qualitative Comparison on the Waymo Dataset. To explicitly verify the visual advancement of our proposed method on the public benchmark, we provide a qualitative comparison in Figure 4. As illustrated, state-of-the-art baselines like OmniRe and Uni-BG struggle to decouple illumination from geometry under varying exposure. This failure causes them to overfit to brightness changes, generating severe floating artifacts—clearly visible as noisy clutter in the depth maps—to minimize photometric error. In contrast, Lumina-4DGS leverages Object-Aware SSIM-gating to explicitly enforce structural integrity, successfully removing these artifacts and producing sharp, clean renderings with highly accurate depth boundaries.
Figure 4.
Qualitative comparison on the Waymo Open Dataset. In the depth visualizations (right column), color encodes the relative distance from the sensor (e.g., from purple for near to yellow for far). (a) Ground truth RGB frames and reference depth captured with independent AE/AWB. (b) OmniRe fails to decouple illumination from geometry. Lacking strict photometric constraints, it overfits to brightness changes by generating severe floating artifacts, visible as noisy clutter in the depth map. (c) Uni-BG applies multi-view consistency but still retains minor floating artifacts in the sky region and produces suboptimal depth boundaries. (d) Lumina-4DGS (Ours) effectively harmonizes exposure and enforces structural integrity via Object-Aware SSIM-gating, successfully removing these artifacts to produce sharp RGB renderings and highly accurate, clean depth maps.
4.2.2. Quantitative Evaluation on Our Self-Collected In-the-Wild Dataset
While public benchmarks provide a standardized training ground, they represent an idealized” autonomous driving scenario. To rigorously evaluate robustness in unconstrained real-world environments, we conduct experiments on our Self-Collected Surround-View Dataset.
Unlike curated public datasets, our proprietary data were captured using a commercial sensor suite without lab-grade synchronization, introducing two distinct challenges:
Photometric Inconsistency: independent auto-exposure (AE) and auto-white-balance (AWB) cause drastic brightness shifts.
LiDAR-Vision FoV Mismatch: our setup exhibits a significant Field-of-View (FoV) gap between the 360° cameras and the sparse LiDAR.
Quantitative Results. As shown in Table 2, this domain gap causes a “performance collapse” for state-of-the-art baselines. Even recent methods designed for complex driving scenes, such as OmniRe [1] and Uni-BG [22], suffer drastic drops in both photometric and geometric accuracy. This confirms that methods relying heavily on accurate geometric initialization or standard bilateral grids fail when LiDAR supervision is sparse and illumination fluctuates drastically.
Table 2.
Quantitative comparison on Our Self-Collected Dataset. This proprietary dataset features severe AE/AWB shifts and large LiDAR-Vision FoV gaps. Lumina-4DGS maintains superior structural and perceptual quality, particularly recovering high-fidelity geometry (RMSE/CD) and details for dynamic objects.
| Method | Full Image | Human | Vehicle | Geometry | |||||
|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | RMSE ↓ | CD (m) ↓ | |
| 3DGS [12] | 23.15 | 0.765 | 0.385 | 16.86 | 0.425 | 16.08 | 0.426 | 3.25 | 0.652 |
| StreetGS [25] | 23.82 | 0.772 | 0.368 | 16.86 | 0.434 | 21.50 | 0.435 | 2.90 | 0.548 |
| OmniRe [1] | 24.90 | 0.796 | 0.344 | 23.11 | 0.690 | 23.23 | 0.780 | 2.65 | 0.456 |
| Uni-BG [22] | 25.10 | 0.801 | 0.330 | 23.95 | 0.610 | 23.85 | 0.805 | 2.78 | 0.420 |
| Lumina-4DGS (Ours) | 27.23 | 0.811 | 0.112 | 23.40 | 0.713 | 25.81 | 0.823 | 2.12 | 0.315 |
In contrast, Lumina-4DGS demonstrates remarkable robustness. By explicitly decoupling exposure from geometry and employing SSIM-Gating to handle the FoV mismatch, we achieve a Full Image PSNR of 27.23 dB (+2.13 dB over Uni-BG). Most importantly, our method significantly improves geometric fidelity in these unconstrained environments, reducing the Chamfer Distance (CD) to 0.315 m, compared to 0.456 m for OmniRe. This proves that our hierarchical approach and gated optimization effectively translate from idealized benchmarks to unconstrained, real-world fleet data by prioritizing structural integrity.
4.2.3. Qualitative Comparison
Figure 5 provides a detailed visual analysis of reconstruction quality under challenging independent auto-exposure conditions.
Figure 5.
Visualizing the impact of exposure inconsistency on geometric reconstruction. (a) Ground truth frames captured with independent AE/AWB, showing significant brightness shifts. (b) OmniRe fails to decouple illumination from geometry. Lacking LiDAR constraints in the sky, it overfits to brightness changes by generating severe floating artifacts to minimize photometric error. (c) Uni-BG applies multi-view consistency constraints but retains minor floating artifacts and slight blurring in the sky region. (d) Lumina-4DGS (Ours) effectively harmonizes exposure and enforces structural integrity via Object-Aware SSIM-gating, successfully removing these artifacts to produce sharp, clean renderings.
As evidenced in the second row, the baseline OmniRe exhibits severe geometry-texture ambiguity. In the absence of LiDAR supervision for the upper field of view (e.g., sky and distant buildings), the model misinterprets rapid photometric shifts as geometric density. This leads to the hallucination of floating artifacts—manifesting as volumetric fog or haze—and results in blurred, inconsistent textures.
In contrast, Lumina-4DGS (third row) employs our SSIM-Gated mechanism to explicitly disentangle sensor dynamics from scene geometry. By penalizing erroneous density growth in photometrically unstable regions, our method suppresses these artifacts, yielding clean, temporally stable renderings that preserve geometric integrity.
4.3. Geometric Consistency and Ablation Study
A key hypothesis of our work is that unconstrained exposure optimization leads to texture-geometry ambiguity, where the model generates geometric artifacts to explain away photometric differences. To verify the efficacy of our design choices and demonstrate how optimization values are affected at different levels of rendering, we conduct an ablation study on the Waymo Open Dataset by incrementally enabling the Global Exposure Module (Global), the Multi-Scale Bilateral Grid (Grid), and the Object-Aware SSIM-Gated Optimization (Gate).
Table 3 summarizes the quantitative results across these progressive rendering levels. We report both photometric metrics (PSNR, SSIM) and comprehensive geometric metrics, including Depth RMSE and Chamfer Distance (CD), to strictly evaluate structural fidelity.
Table 3.
Ablation study of Lumina-4DGS evaluating the contribution of each component at different rendering levels. Global: Global Exposure Module. Grid: Multi-Scale Bilateral Grid. Gate: Object-Aware SSIM-Gating. Note the severe trade-off between photometric fitting and geometric integrity (RMSE and CD) at Level 2, which is successfully resolved by the gating mechanism at Level 3. Bold numbers indicate the best results. The arrows (↑ and ↓) indicate whether higher or lower values represent better performance.
| Rendering Level | Components | Photometry | Geometry (m) | ||||
|---|---|---|---|---|---|---|---|
| Global | Grid | Gate | PSNR ↑ | SSIM ↑ | RMSE ↓ | CD ↓ | |
| Level 0 (Baseline) | × | × | × | 28.15 | 0.852 | 2.80 | 0.415 |
| Level 1 (+Global) | ✓ | × | × | 30.50 | 0.880 | 2.75 | 0.392 |
| Level 2 (+Grid) | ✓ | ✓ | × | 32.78 | 0.920 | 2.95 | 0.438 |
| Level 3 (Full Model) | ✓ | ✓ | ✓ | 31.13 | 0.915 | 1.89 | 0.215 |
Analysis of Results at Different Rendering Levels:
Level 1 (Effect of Global Decoupling): Adding global compensation yields a significant boost in baseline photometric quality (PSNR increases from 28.15 dB to 30.50 dB). This confirms that large-scale, image-wide sensor shifts must be stabilized before localized refinement.
Level 2 (The Overfitting Trap): When the Multi-Scale Bilateral Grid is introduced without gating, the model achieves peak photometric scores (PSNR: 32.78 dB). However, the geometric integrity degrades significantly, with Depth RMSE increasing to 2.95 m and Chamfer Distance rising to 0.438 m. This quantitatively proves that an unconstrained appearance model will visually overfit by deforming scene geometry to minimize RGB loss, manifesting as floating artifacts.
Level 3 (Efficacy of SSIM-Gating): Enabling the Object-Aware SSIM-Gated Optimization successfully resolves this texture-geometry ambiguity. Although there is a minor, expected PSNR drop (−0.32 dB) due to restricted photometric freedom, the geometric accuracy improves drastically: RMSE drops to 1.89 m and CD reduces to an optimal 0.215 m. This demonstrates that our highest level of rendering effectively prioritizes correct physical structure over photometric overfitting.
Qualitative Evaluation of Ablation Modules. To explicitly verify the functions of our different modules and visually demonstrate the “overfitting trap,” we provide a qualitative ablation comparison in Figure 6. As rendering complexity increases to Level 2 (unconstrained Grid), the model forcibly alters the scene’s geometric density to fit complex lighting variations, resulting in noticeable depth degradation. Level 3 (Full Model) successfully cures this by gating the gradients, ensuring sharp boundaries and accurate depth reconstruction.
Figure 6.
Qualitative ablation study on rendering levels. (a) Ground truth reference. (b) Level 0 (Baseline) fails to reconstruct proper exposure. (c) Level 2 (Unconstrained Grid) artificially inflates geometric density to explain away complex photometric shifts, causing extreme noise and floating artifacts in the depth map (texture-geometry ambiguity). (d) Level 3 (Full Model) utilizes Object-Aware SSIM-Gating to protect structural integrity, yielding sharp RGB details while fully recovering the clean, accurate depth boundaries.
Analysis of Gating Dynamics
To explicitly demonstrate the operational mechanism of our Object-Aware SSIM-Gating, we visualize the optimization dynamics in Figure 7. During the early geometric warm-up phase (grey-shaded region), the object-level SSIM remains below the dynamically annealed threshold (linearly scaled from 0.2 to 0.7). Consequently, the appearance gradients (orange line) are strictly masked to zero.
Figure 7.
Dynamics of Object-Aware SSIM-Gating. The grey region indicates the geometric warm-up phase where appearance gradients are masked to zero. The green region denotes the photometric refinement phase, which activates only after the Object SSIM surpasses the progressive threshold .
This masking forces the 3D Gaussians to align with true 2D object boundaries solely via geometric adjustments, effectively correcting the coarse boundaries inherent to LiDAR voxelization. Once the structural fidelity exceeds the safety threshold (green-shaded region), the gate activates. This unfreezes the Bilateral Grid, allowing it to learn fine-grained photometric details while the underlying geometric integrity is securely preserved.
5. Discussion
The experimental results presented in Section 4 demonstrate that Lumina-4DGS effectively resolves the long-standing challenge of photorealistic reconstruction under unconstrained illumination conditions. In this section, we interpret these findings in the context of previous studies, analyze the underlying mechanisms of our success and discuss the broader implications for autonomous driving simulation.
5.1. Resolving the Texture-Geometry Ambiguity
A critical finding of our study is the confirmation of the “texture-geometry ambiguity” hypothesis. Previous state-of-the-art methods, such as OmniRe [1] and Street Gaussians [25], operate under the assumption that photometric consistency correlates directly with geometric accuracy. However, our ablation studies (Table 3) reveal that this assumption breaks down in “in-the-wild” scenarios with independent auto-exposure. When the rendering equation is forced to minimize the RGB loss against fluctuating brightness without explicit exposure decoupling, the optimizer resorts to “overfitting” by deforming the scene geometry, manifesting as the floating artifacts observed in Figure 5.
Our method fundamentally alters this optimization landscape. By introducing the Global Exposure Module, we mathematically disentangle sensor-induced sensitivity shifts from physical surface albedo. More importantly, SSIM-Gated Optimization serves as a structural regularizer. By rejecting gradient updates in regions where structural similarity is low (indicating a transient photometric error rather than a geometric misalignment), we force the Gaussian primitives to adhere to the physical scene geometry. This explains why Lumina-4DGS maintains the lowest Depth RMSE (1.89 m) even when photometrically outperforming baselines.
5.2. Bridging the Gap to Production Data
Most existing NeRF and 3DGS-based approaches are benchmarked on curated datasets like Waymo or NuScenes, which feature synchronized sensors and consistent exposure. Our experiments on the Self-Collected Surround-View Dataset highlight a significant “domain gap” between these idealized benchmarks and production-grade sensor data.
The performance collapse of baselines on our custom dataset (Table 2) underscores the fragility of current SOTA methods when facing LiDAR-Vision FoV mismatches and independent AE/AWB. Lumina-4DGS demonstrates that robust view synthesis in real-world applications requires modeling the sensor’s physical characteristics (e.g., ISO gain, vignetting) as part of the reconstruction pipeline. This capability is particularly valuable for building high-fidelity digital twins using low-cost, unsynchronized commercial fleets, significantly lowering the barrier to entry for large-scale data simulation.
5.3. Limitations and Future Directions
Despite these advancements, our current framework has limitations. First, while we effectively handle sensor-induced exposure changes, physical illumination changes caused by dynamic weather (e.g., moving cloud shadows, heavy rain, or snow) introduce complex light transport effects that our affine model cannot fully capture. Second, in regions with extreme motion blur or complete darkness, the SSIM gating mechanism may become overly conservative, potentially hindering geometry convergence.
Future research will focus on two directions: (1) integrating physics-based weather rendering models to separate environmental illumination from sensor exposure; and (2) extending our framework to an end-to-end neural sensor simulation pipeline, allowing for the synthesis of not just RGB images, but also raw sensor data with realistic noise profiles for downstream perception testing.
6. Conclusions
In this paper, we presented Lumina-4DGS, a novel framework designed to achieve illumination-robust 4D Gaussian Splatting for dynamic scene reconstruction. Addressing the limitations of existing methods in unconstrained environments, we identified the “texture-geometry ambiguity” as the primary obstacle where dynamic illumination shifts are often misinterpreted as geometric motion or structural noise.
To overcome this, we introduced a hierarchical exposure compensation pipeline integrated with an Object-Aware SSIM-Gated Optimization strategy. This approach effectively decouples sensor-induced photometric variations from the true temporal dynamics of the 4D scene. Extensive experiments on the Waymo Open Dataset and our challenging self-collected fleet dataset demonstrate that Lumina-4DGS not only achieves state-of-the-art rendering quality under rapid exposure changes but also recovers geometrically consistent structures, as validated against LiDAR ground truth.
By enabling robust reconstruction in the presence of independent auto-exposure and varying lighting conditions, Lumina-4DGS significantly closes the gap between idealized benchmarks and real-world autonomous driving data.
Moving forward, our future research will focus on two primary directions. First, leveraging our explicitly decoupled geometry and illumination representations, we plan to extend Lumina-4DGS to support controllable scene relighting and dynamic material editing. Second, we aim to investigate the framework’s robustness against more complex, physics-based environmental disturbances, such as severe weather conditions (e.g., rain, snow) and dynamic surface reflections. We believe these advancements will further solidify the foundation for creating high-fidelity, physically consistent digital twins for dynamic urban environments.
Acknowledgments
We thank the Waymo Open Dataset team for providing the high-quality autonomous driving data that made this research possible. We also extend our gratitude to the authors of OmniRe for open-sourcing their codebase, which served as a valuable baseline for our comparative analysis. Finally, we acknowledge the technical support provided by the DriveStudio team.
Author Contributions
Conceptualization, X.W.; methodology, X.W.; validation, X.W. and Y.S.; formal analysis, X.W. and S.L.; data curation, S.L.; writing—original draft preparation, X.W., Y.S. and S.L.; writing—review and editing, X.W. and Q.W.; supervision, Q.W.; project administration, Q.W.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available datasets were analyzed in this study. The Waymo Open Dataset (specifically sequences 000–005 and 010–015) can be found at https://waymo.com/open/ (accessed on 1 December 2025).
Conflicts of Interest
The authors declare no conflicts of interest.
Funding Statement
This research was funded by the National Natural Science Foundation of China, grant number 42374029.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Chen Z., Yang J., Huang J., de Lutio R., Martinez Esturo J., Ivanovic B., Litany O., Gojcic Z., Fidler S., Pavone M., et al. Omnire: Omni urban scene reconstruction. arXiv. 2024 doi: 10.48550/arXiv.2408.16760.2408.16760 [DOI] [Google Scholar]
- 2.Kulhanek J., Peng S., Kukelova Z., Pollefeys M., Sattler T. Wildgaussians: 3d gaussian splatting in the wild. arXiv. 2024 doi: 10.48550/arXiv.2407.08447.2407.08447 [DOI] [Google Scholar]
- 3.Chen X., Xiong Z., Chen Y., Li G., Wang N., Luo H., Chen L., Sun H., Wang B., Chen G., et al. DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images. arXiv. 2025 doi: 10.48550/arXiv.2512.03004.2512.03004 [DOI] [Google Scholar]
- 4.Tonderski A., Lindström C., Hess G., Ljungbergh W., Svensson L., Petersson C. Neurad: Neural rendering for autonomous driving; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 16–22 June 2024. [Google Scholar]
- 5.Huang N., Wei X., Zheng W., An P., Lu M., Zhan W., Tomizuka M., Keutzer K., Zhang S. S3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving. arXiv. 20242405.20323 [Google Scholar]
- 6.Zhou X., Lin Z., Shan X., Wang Y., Sun D., Yang M.-H. DrivingGaussian: Composite Gaussian splatting for surrounding dynamic autonomous driving scenes; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA. 16–22 June 2024; pp. 21634–21643. [Google Scholar]
- 7.Wang J., Che H., Chen Y., Yang Z., Goli L., Manivasagam S., Urtasun R. Flux4D: Flow-based Unsupervised 4D Reconstruction. arXiv. 20252512.03210 [Google Scholar]
- 8.Mildenhall B., Srinivasan P.P., Tancik M.T., Barron J.T., Ramamoorthi R., Ng R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM. 2021;65:99–106. doi: 10.1145/3503250. [DOI] [Google Scholar]
- 9.Wang P., Liu L., Liu Y., Theobalt C., Komura T., Wang W. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv. 20212106.10689 [Google Scholar]
- 10.Yuan S., Zhao H. SlimmeRF: Slimmable Radiance Fields; Proceedings of the 2024 International Conference on 3D Vision (3DV); Davos, Switzerland. 18–21 March 2024; pp. 64–74. [Google Scholar]
- 11.He L., Li L., Sun W., Han Z., Liu Y., Zheng S., Wang J., Li K. Neural Radiance Field in Autonomous Driving: A Survey. arXiv. 2024 doi: 10.48550/arXiv.2404.13816.2404.13816 [DOI] [Google Scholar]
- 12.Kerbl B., Kopanas G., Leimkühler T., Drettakis G. 3D Gaussian Splatting for Real-time Radiance Field Rendering. ACM Trans. Graph. 2023;42:139. doi: 10.1145/3592433. [DOI] [Google Scholar]
- 13.Ye S., Dong Z.-H., Hu Y., Wen Y.-H., Liu Y.-J. Gaussian in the Dark: Real-Time View Synthesis From Inconsistent Dark Images Using Gaussian Splatting; Proceedings of the Pacific Graphics 2024 (PG 2024); Huangshan, China. 13–16 October 2024. [Google Scholar]
- 14.Zhang D., Wang C., Wang W., Li P., Qin M., Wang H. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections; Proceedings of the European Conference on Computer Vision (ECCV); Milan, Italy. 29 September–4 October 2024; pp. 341–359. [Google Scholar]
- 15.Liu H., Jiang P., Huang J., Lu M. Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration. arXiv. 20252511.09818 [Google Scholar]
- 16.Liu M., Liu J., Zhang Y., Li J., Yang M.Y., Nex F., Cheng H. 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation; Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI); Philadelphia, PA, USA. 25 February–4 March 2025. [Google Scholar]
- 17.Zhang S., Ye B., Chen X., Chen Y., Zhang Z., Peng C., Shi Y., Zhao H. Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty; Proceedings of the British Machine Vision Conference (BMVC); Scotland, UK. 25–28 November 2024. [Google Scholar]
- 18.Wang Y., Wang C., Gong B., Xue T. Bilateral guided radiance field processing. ACM Trans. Graph. (TOG) 2024;43:148. doi: 10.1145/3658148. [DOI] [Google Scholar]
- 19.Martin-Brualla R., Radwan N., Sajjadi M.S.M., Barron J.T., Dosovitskiy A., Duckworth D. NeRF in the Wild: Neural radiance fields for unconstrained photo collections; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA. 20–25 June 2021; pp. 2–8. [Google Scholar]
- 20.Tancik M., Weber E., Ng E., Li R., Yi B., Wang T., Kristoffersen A., Austin J., Salahi K., Ahuja A., et al. Nerfstudio: A modular framework for neural radiance field development; Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings; Los Angeles, CA, USA. 6–10 August 2023; [Google Scholar]
- 21.Fridovich-Keil S., Meanti G., Warburg F.R., Recht B., Kanazawa A. K-Planes: Explicit radiance fields in space, time, and appearance; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada. 17–24 June 2023. [Google Scholar]
- 22.Wang N., Chen Y., Xiao L., Xiao W., Li B., Chen Z., Ye C., Xu S., Zhang S., Yan Z., et al. Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting. arXiv. 2025 doi: 10.48550/arXiv.2506.05280.2506.05280 [DOI] [Google Scholar]
- 23.Sun P., Kretzschmar H., Dotiwalla X., Chouard A., Patnaik V., Tsui P., Guo J., Zhou Y., Chai Y., Caine B., et al. Scalability in perception for autonomous driving: Waymo open dataset; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 13–19 June 2020; pp. 2446–2454. [Google Scholar]
- 24.Wu Z., Liu T., Luo L., Zhong Z., Chen J., Xiao H., Hou C., Lou H., Chen Y., Yang R., et al. Mars: An instance-aware, modular and realistic simulator for autonomous driving; Proceedings of the CAAI International Conference on Artificial Intelligence; Fuzhou, China. 22–23 July 2023; pp. 3–15. [Google Scholar]
- 25.Yan Y., Lin H., Zhou C., Wang W., Sun H., Zhan K., Lang X., Zhou X., Peng S. Street Gaussians: Modeling dynamic urban scenes with gaussian splatting; Proceedings of the European Conference on Computer Vision; Milan, Italy. 29 September–4 October 2024; pp. 156–173. [Google Scholar]
- 26.Wei X., Ye Z., Gu Y., Zhu Z., Guo Y., Shen Y., Zhao S., Lu M., Sun H., Wang B., et al. ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking. arXiv. 20262601.01386 [Google Scholar]
- 27.Huang X., Li J., Wu T., Zhou X., Han Z., Gao F. Flying in Clutter on Monocular RGB by Learning in 3D Radiance Fields with Domain Adaptation; Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA); Atlanta, GA, USA. 19–23 May 2025. [Google Scholar]
- 28.Afifi M., Zhao L., Punnappurath A., Abdelsalam M.A., Zhang R., Brown M.S. Time-Aware Auto White Balance in Mobile Photography; Proceedings of the IEEE International Conference on Computer Vision (ICCV); Honolulu, HI, USA. 19–23 October 2025; pp. 64–74. [Google Scholar]
- 29.Dahmani H., Bennehar M., Piasco N., Roldao L., Tsishkou D. SWAG: Splatting in the wild images with appearance-conditioned gaussians. arXiv. 2024 doi: 10.48550/arXiv.2403.10427.2403.10427 [DOI] [Google Scholar]
- 30.Fu C., Chen G., Zhang Y., Yao K., Xiong Y., Huang C., Cui S., Matsushita Y., Cao X. RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS. arXiv. 20252512.04815 [Google Scholar]
- 31.Huang Y., Bai L., Cui B., Li Y., Chen T., Wang J., Wu J., Lei Z., Liu H., Ren H. Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting; Proceedings of the 2025 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI); Daejeon, Republic of Korea. 23–27 September 2025. [Google Scholar]
- 32.Du Y., Zhang Y., Yu H.-X., Tenenbaum J.B., Wu J. Neural radiance flow for 4D view synthesis and video processing; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); Montreal, QC, Canada. 10–17 October 2021; [Google Scholar]
- 33.Xu B., Xu Y., Yang X., Jia W., Guo Y. Bilateral grid learning for stereo matching networks; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA. 20–25 June 2021; pp. 12497–12506. [Google Scholar]
- 34.Guédon A., Lepetit V. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. arXiv. 20232311.12775 [Google Scholar]
- 35.Jiang Y., Tu J., Liu Y., Gao X., Long X., Wang W., Ma Y. GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces. arXiv. 2023 doi: 10.48550/arXiv.2311.17977.2311.17977 [DOI] [Google Scholar]
- 36.Huang B., Yu Z., Chen A., Geiger A., Gao S. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields; Proceedings of the ACM SIGGRAPH 2024 Conference Papers; Denver, CO, USA. 27 July–1 August 2024; pp. 1–11. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Publicly available datasets were analyzed in this study. The Waymo Open Dataset (specifically sequences 000–005 and 010–015) can be found at https://waymo.com/open/ (accessed on 1 December 2025).








