Abstract
Light field rendering is widely applied to virtual reality (VR), augmented reality (AR), mixed reality (MR) and extended reality (XR). For photorealistic light field displays, it requires a dense view sampling of the scene. However, in dynamic immersive interactions, the available observations are often too sparse to synthesize the complete light field required for a high-fidelity display. Therefore, it poses a huge challenge for generating photometrically consistent views between the virtual and real world. Here, we introduce a neural illumination estimation and editing framework for adaptive light field synthesis. The proposed method can explicitly encode intrinsic parameters of illumination from one single sampling view, which is used for a hybrid-guided generative network to synthesize photometrically plausible dense views of the scene under the guidance of a complete rendering model. It deconstructs the baked-in lighting to enable consistent and high-fidelity relighting from any viewpoint. Our method estimates and edits illumination with only 0.2397 W m−2 irradiance error and 7.02∘ angular deviation, yielding synthesized images with an average 17.0% improvement in PSNR and a 51.2% reduction in LPIPS. This work presents a practical pathway towards truly interactive and adaptive digital light fields, enabling photorealistic content generation for the next generation of near-eye displays and computational imaging systems.
Subject terms: Displays, Imaging and sensing
Single-view neural illumination estimation enables interactive lighting editing and perceptual consistency for immersive dynamic light field displays.

Introduction
The light field is an intrinsic description of the light propagation in space, which encodes the radiance information of light rays at every position and direction1–3. Therefore, light field rendering generates true-to-nature visual scenes. This characteristic makes light field rendering a foundational technology for advanced display systems, including light field and holographic displays, which aim to deliver true-to-nature visual experiences for the observer. Traditional light field rendering methods model complex interaction relationships between the ambient light field, geometry and materials of scenes to control their appearance. With the application of deep learning in optics, computational imaging breaks the limitations of statistical inference, computation, and inverse design in imaging systems4–7. Indeed, some impressive applications have demonstrated its effectiveness, such as the real-time generation of photorealistic 3D holograms8 and the reconstruction of information from sparse samples via compressive sensing9. For light field displays, neural illumination converges the sensing, processing, and synthesizing of light fields to bridge sparse real-world measurements with the dense data required for their computational synthesis10,11.
Immersive interaction aims to provide a sense of immersion and presence for users and thus asks to obtain a perceptually seamless fusion between virtual content and a user’s real-world environment with high realism12,13. The fidelity of this integration critically depends on photometric consistency14,15, which is achieved through a high-fidelity light transport simulation based on the rendering equation16. This simulation requires both accurate physical modeling17 and a real-time rendering. To achieve this goal, extensive research on sensing hardware has been motivated. Recently, some significant advancements have been made in these areas, such as illumination field reconstruction18 and transient light transport analysis19. For example, advanced optical see-through (OST) systems based on waveguides and diffractive optical elements (DOEs)20–22 continually improve display efficiency and field of view, while sophisticated liquid crystal optics23–26 enhance the visual immersion in immersive interaction. However, these advanced display systems face a critical content bottleneck: the on-board sensors are observation-limited and cannot capture sufficient environmental light information to dynamically render a photometrically consistent virtual light field. To overcome this problem, light field rendering requires a dense view sampling of the scene to sufficiently capture its spatial and angular radiance information3. These methods capture the specific environmental illumination conditions, which are jointly encoded with geometry and material properties within the asset’s parameters. This illumination coding, termed “baked-in illumination” 27–29, becomes a significant limitation for interactive displays, as the rendered assets cannot photometrically adapt to the user’s real-time environment. For instance, the neural radiance fields (NeRFs) have been a huge success in novel view synthesis and serve as powerful core assets within rendering engines by learning a continuous mapping from a 5D coordinate to color and density30. Neural illumination can also be supported by explicitly coding, such as 3D Gaussian splatting31, which allows for high-quality novel-view synthesis at competitive speeds by fitting the Gaussian spherical harmonic function of each point from all views.
Although neural illumination enhances the sense of reality in the visual experience, it deeply relies on dense views of the scene. However, observation views are limited, along with dynamically changed light ray information in immersive interaction. In this context, “dynamic” refers specifically to the variation of incident environmental illumination such as sudden changes along with scene transitions and continuous temporal changes in intensity and direction. This is distinct from viewpoint changes caused by user interaction. Consequently, new target illumination cannot be estimated directly, and “baked-in illumination” cannot be edited directly. This problem is especially serious in mixed reality (MR). Since light ray information is not stable in the real world, virtual content needs to retain dynamic photometric coherence with environmental illumination. This is not only an aesthetic concern but also ensures the perceptual coherence and interaction fidelity of the entire MR experience. Because correct illumination controls the visual realism of synthetic content, illumination perception influences user perception of depth and materials. In MR, this limitation manifests as a fundamental barrier to achieving photometric coherence under dynamically changed real-world conditions. However, this deep entanglement of scene properties results in a visually rich but optically static asset, unable to react to dynamic changes in real-world lighting or to facilitate intuitive artistic edits. Adapting such baked-in representations to dynamic lighting would typically require prohibitive retraining or dense multi-view inputs, which is impractical for immersive interaction. To estimate and edit illumination from limited views of observation, many studies are explored, such as various approaches from intrinsic image analysis32 to solving complex inverse problems in scattering media10 or with single-pixel detectors33. Nevertheless, they have their own limitations. Many 3D relighting pipelines assume that the target illumination is provided as dense global measurements such as high-dynamic-range (HDR) environment maps, which are difficult to acquire from a single sparse observation, and reliance on per-scene optimization often limits their generalization to the open world. For instance, traditional heuristic approaches analyze cues like shadows and highlights but lack robustness34,35. Learning-based methods have shown promise in predicting indoor illumination distributions36,37 or estimating lighting for specific tasks like face relighting38 but it can only exhibit limited generalization. Specifically, some are designed for view-misaligned scenarios39, some leverage textual prompts for control40, and some achieve high quality but require training a separate model for each object41, which is highly infeasible and lack of flexibility in practice or real-time applications. On the whole, a critical gap remains for a solution that can generate generalizable and physically plausible light fields for dynamic display under the real-world constraint of sparse observations views in dynamic light field settings.
In this work, we propose a novel neural illumination framework with one single observation view by explicitly encoding intrinsic parameters of illumination. These parameters can synthesize dense view sampling with any light field rendering model. Inspired by the computational imaging paradigm, the direct solution for the intractable inverse problem of full light field recovery is avoided. We first design a computational optical perception (COP) module to estimate a compact, explicit and parametric representation of the dominant illumination from only one observation view. While MR devices often provide stereo cameras, our method is defined as single-view because it requires only one input image stream as the lower-bound sensing condition for illumination estimation. Stereo can be used as an optional enhancement but is not a dependency. Since global illumination cues are approximately consistent across the small stereo baseline, this design avoids redundant computation and preserves compatibility with monocular devices. Second, these explicit light codes guide a generative light transport synthesis (GLTS) module to computationally solve the forward problem, rendering a new photometrical 2D view of the scene under the target illumination condition. Moreover, a complete light field rendering for the scene should be modeled under a fixed preset illumination condition. Third, the set of synthesized views serves as supervision for a joint optimization process over both the appearance and camera geometry of the scene, globally reconstructing a new coherent 3D representation with the help of the light field rendering model. This entirely deconstructs the baked-in lighting, enabling consistent, high-fidelity neural illumination editing from any viewpoint. Our method does not need panoramic environment maps, dense multi-view supervision, or explicit material decomposition. It can embed the virtual sensing of the human visual system into the real environment. The proposed method is expected to provide a practical and scalable visual perception experience toward the next generation of truly immersive display systems, especially mixed reality.
Results
From a first-principles optics perspective, any visual scene is comprehensively described by its light field, a fundamental concept that characterizes the flow of light in free space. In its complete form, the light field is represented by the 5D Plenoptic Function, which defines the radiance of every light ray at every point in space and in every direction:
| 1 |
where (x, y, z) are spatial coordinates, (θ, ϕ) define the ray’s direction, and radiance L is the physical quantity of light intensity. While the light field describes light in space, the appearance of objects is determined by the interaction between an incident light field and the scene’s properties. This interaction is physically governed by the Rendering Equation:
| 2 |
where Le(x, ωo) is the emitted radiance, the outgoing light field Lo (what is ultimately captured) is the integral of the incident field Li over the hemisphere Ω, modulated by the material’s Bidirectional Reflectance Distribution Function (fr) and the surface geometry (n). Computational methods for light field display are focused on sensing, processing, and synthesizing light fields to drive these advanced visual systems, often by solving an inverse problem: inferring properties of the scene (fr, n) or the illumination (Li) from measurements of the outgoing field (Lo).
A primary frontier for computational imaging is immersive interaction, often facilitated by modern head-mounted displays, where the objective is to project a virtual light field that creates a perceptually seamless superposition with the observer’s real-world view. The final image perceived by the observer, IMR, is an integration of this composite light field:
| 3 |
For this fusion to be realistic, as shown by Eq. (2), the virtual light field must be synthesized using the real environment’s incident illumination, Li,real:
| 4 |
This reveals the fundamental optical challenge for realistic mixed reality displays: to generate the required virtual content, one must first solve the inverse problem of sensing and modeling the real-world illumination field, Li,real, from limited observations.
This challenge is particularly acute for modern 3D scene representations like implicit neural fields. These models learn a function , parameterized by network weights θ, which maps a 5D coordinate directly to an outgoing radiance value. During their training, they jointly optimize for geometry, materials, and lighting, effectively “baking-in” the specific incident illumination field from the training data, Li,training, into their weights:
| 5 |
The task of estimating and editing for an interactive display is to modify this representation to generate a new outgoing light field, Lo,target, that corresponds to a new target illumination, Li,target. This can be formulated as finding a new set of parameters such that:
| 6 |
Directly solving this integral is intractable, as the scene’s implicit bidirectional reflectance distribution function (BRDF) (fr,implicit) is non-trivially entangled within the thousands of parameters of the neural network . A conventional approach might attempt to first reconstruct a full panoramic environment map from the sparse inputs and then use it to solve the integral, which is a severely ill-posed problem. To address this issue, our framework proposes a more direct and robust pathway. Instead of attempting this intractable intermediate reconstruction, our approach bypasses the need for an environment map altogether. We first employ the COP module to directly infer a compact, parametric representation of the dominant illumination from the single image , and then take a generative process to synthesize the resulting outgoing light field, which in turn guides the update of the neural representation from θ to . Our entire process is shown in Fig. 1.
Fig. 1. Overview of our proposed neural illumination estimation and editing framework, illustrating the end-to-end information flow from real-world perception to display on a wearable device.
The process begins as the COP module captures sparse optical cues from the ambient Environmental Lighting to infer a parametric model of the scene’s illumination. Guided by these parameters, the GLTS module generates a photorealistic Synthesized Light Field. This virtual content is processed and transmitted via the Edge Computing Module to the display terminal. Finally, as illustrated in the XR Device Diagram, the light is modulated and coupled for projection to the user, ensuring the resulting virtual image is photometrically consistent with the real world
Illumination field inference from one single view of optical cue
The foundational stage of our framework is to solve the optical inverse problem of characterizing the incident illumination field from a sparse set of 2D projections , with one single view of observation. To this end, we developed the COP module. Instead of attempting an intractable full reconstruction of the light field, the COP module employs a two-stage process to infer a compact and effective parametric representation of the dominant illumination.
The first stage is a multi-scale inference engine, , which is responsible for the primary numerical estimation. The input image is processed using a feature extractor backbone that produces multi-scale feature maps. To precisely identify the most informative photometric features, we employ a bespoke attention mechanism at each scale. For a feature map X, this process is defined as:
| 7 |
where MC and MS are the channel and spatial attention maps, respectively, and ⊗ denotes element-wise multiplication. This allows the network to focus on critical physical cues like specular highlights. The attention-weighted multi-scale features are then aggregated and fed into two parallel Multi-Layer Perceptron (MLP) heads to produce structured latent codes, which we denote as the implicit illumination parameters: the effective irradiance E and a 3D directional vector D. The irradiance is computed via our Non-linear Irradiance Manifold Interpolation (NIMI) technique. Standard imaging pipelines (e.g., physical cameras or rendering engines) apply non-linear tone mapping curves such as Filmic or Gamma compression to map high dynamic range radiance to low dynamic range pixel values. Consequently, direct linear interpolation in the latent space would result in photometric artifacts due to this non-linearity. To address the non-linear response of modern imaging pipelines42, NIMI performs an inverse mapping to project the predicted continuous irradiance values onto an approximately linearized radiometric manifold43. This manifold is spanned by the nearest learned discrete intensity anchors that have been mapped into the linear domain. By interpolating within this linearized space, the process ensures that illumination states evolve in a physically linear manner44, effectively avoiding photometric artifacts caused by non-proportional scaling in tone-mapped space45. This ensures that the synthesized intermediate illumination states remain physically linear and continuous, effectively decoupling the light transport simulation from the non-linear response of the imaging system. These parameters represent an intermediate, unrefined encoding of the dominant light’s properties, learned directly by the network. The output of this first stage is thus:
| 8 |
The second stage is a semantic interpreter, , designed to enhance the system’s robustness and provide an intuitive, high-level control signal. This stage is crucial because the initial latent parameters, while quantitatively useful, can be unstable in ambiguous lighting scenarios. To address this issue, the module takes the implicit parameters E and D from the first stage as input. Built upon a vision transformer (ViT) and a generative pre-trained transformer (GPT) decoder, it leverages the powerful generative prior of the language model to perform two concurrent tasks: it refines the latent inputs into final, physically-plausible explicit parameters (E, D), and simultaneously translates them into an interpretable textual description of the lighting geometry, Dtext (e.g., “The light comes from the upper right, and the shadow appears on the left side.”). This estimate-and-refine strategy is highly effective, as the interpreter acts as a powerful prior, correcting potential instabilities from the initial direct estimation and ensuring a consistent, multi-modal output for guiding the synthesis stage.
| 9 |
This text-based representation offers superior robustness against noise and serves as a powerful and intuitive global prior for the subsequent generative synthesis stage.
The performance of the COP module is detailed in Table 1. The inference engine achieves a mean absolute error below 0.3 W m−2 for irradiance and a mean angular error of 7.02 degrees for the direction vector. The subsequent semantic interpreter successfully refines the initial latent estimates and translates them into contextually appropriate textual descriptions. This two-stage design, as a key feature of our framework, can provide both precise numerical predictions and a robust semantic descriptor, enabling high-fidelity synthesis and intuitive illumination control.
Table 1.
Quantitative evaluation of our illumination perception module (COP)
| Intensity (W m−2) | Direction (Azimuth, Elevation) | |||||
|---|---|---|---|---|---|---|
| Scene ID | Ground Truth | Predicted | Abs. Error | Ground Truth | Predicted | Angular Error |
| Signal | 8.0 | 8.21 | 0.21 | (45∘, 30∘) | (51∘, 33∘) | 6.78∘ |
| Amongus | 4.5 | 4.75 | 0.25 | (20∘, 25∘) | (27∘, 29∘) | 7.30∘ |
| Hotdogs | 12.2 | 12.5 | 0.30 | (30∘, 70∘) | (35∘, 77∘) | 8.00∘ |
| Space | 18.0 | 17.84 | 0.16 | (70∘, 15∘) | (64∘, 13∘) | 6.00∘ |
| Template | 6.5 | 6.70 | 0.20 | (50∘, 50∘) | (57∘, 55∘) | 6.90∘ |
| Room | 9.0 | 8.74 | 0.26 | (80∘, 35∘) | (88∘, 38∘) | 7.14∘ |
| Mean Error | – | 0.23 | – | 7.02∘ | ||
The numerical intensity data corresponds to the preset solar irradiance parameters defined during the data generation process. These values serve as the ground truth inputs for the rendering model, thereby eliminating the need for post-hoc radiometric calibration or geometric registration. The module achieves high accuracy in estimating both the intensity and direction of the dominant light source across various scenes, with light sources positioned in the top-right quadrant relative to the scene origin. The final row displays the mean absolute error (MAE) averaged across all tested scenes. All angular measurements are reported in the Horizontal Coordinate System, with azimuth and elevation angles defining the light source direction
Generative synthesis of a 2D light field slice
With the illumination parameters (E, D, Dtext) computationally perceived, the second stage of our framework addresses the forward problem: synthesizing a new, physically plausible viewpoint which is a 2D slice of the outgoing light field, . To achieve this, we developed the GLTS, . Our design is built upon the principles of multi-domain image-to-image translation, which utilizes a single, versatile generator to learn mappings between multiple appearance domains. However, a key challenge is that conventional implementations of such models often require predefined discrete domains, which is unsuitable for the continuous and unpredictable nature of real-world illumination. Our GLTS overcomes this by dynamically conditioning the synthesis process based on a novel hybrid guidance mechanism.
The synthesis process transforms an initial rendered view from a 3D neural representation, Render(θ), into the slice:
| 10 |
where θ is the parameter of the original neural scene. Our hybrid guidance mechanism combines high-level parametric control with fine-grained visual calibration. First, the inferred illumination parameters provide the macroscopic guidance. The semantic descriptor Dtext and the vector D are encoded to configure a mapping network, which computes a latent style code slatent that sets the global properties of the light transport.
| 11 |
While this defines the global behavior, the single observation contain invaluable high-frequency optical details. To incorporate this, our system performs a microscopic calibration. An encoder network, , analyzes an observed image to extract a visual style code, svisual, which captures subtle, scene-specific phenomena. The final calibrated style code, sfinal, is a fusion of these two components modulated by the incident irradiance E:
| 12 |
where γ and β are learnable scaling and blending factors. This hybrid approach ensures the synthesized light field slice is not only globally consistent with the target illumination but also locally faithful to observed optical phenomena. This design provides crucial flexibility: in scenarios where an input sparse illumination image is unavailable, the system can operate in a text-only mode, relying solely on the macroscopic guidance slatent for illumination editing and generation. When an input image is provided, its inclusion via svisual serves to optimize and refine the synthesis with fine-grained, scene-specific optical details, thus enhancing the final generated effect.
Furthermore, we specialize the training objective to maximize its efficacy for the relighting task. While such versatile generative architectures are designed to learn transformations between different object identities, this capability is unnecessary for our purpose and would divert the model’s learning capacity. We therefore tailor the training process to focus the model’s entire capacity on a single critical task, modeling the appearance of the same object as it responds to a continuous spectrum of illumination states. This specialized training scheme allows the GLTS to learn a more accurate and disentangled representation of the object’s intrinsic light transport properties (its implicit BRDF), without being confounded by irrelevant variations in geometry or texture. To implement this, the Generator and a corresponding Discriminator are trained adversarially, forming a conditional Generative Adversarial Network (GAN) as illustrated in our framework architecture (Fig. 4). During the training process, the Discriminator is tasked with distinguishing real images from the edited images produced by the Generator. Crucially, the Discriminator is conditioned on both the target single illumination image and its textual description. This dual conditioning enables it to learn the nuanced stylistic features of the target light domain, providing a powerful training signal that compels the Generator to produce images that are not only realistic but also precisely aligned with the desired illumination characteristics. In the inference process, this allows the updated 3D representation to be edited based on the output from the Generator.
Fig. 4.
The architecture of our computational framework, composed of the COP and GLTS modules, for single-view illumination inference and generative synthesis
To show the core capability of this synthesis engine, we evaluated the GLTS under precisely controlled, programmatic illumination. As shown in Fig. 2, as the dominant light source is programmatically shifted, our synthesized result accurately reproduces the corresponding migration of specular highlights and the geometric transformation of cast shadows, closely matching the ground-truth. This demonstrates that our GLTS, through its specialized training and hybrid guidance, has learned a physically plausible and controllable light transport model. Furthermore, to showcase the model’s ability to handle dynamic changes, Fig. 3 visualizes the continuity and consistency of relighting under varying illumination. Our method successfully generates a seamless and physically plausible transition of highlights and shadows, a critical capability for creating truly interactive experiences. In contrast, while a method like NRHints (Fig. 3, 4th row) can reproduce the intensity and position of specular highlights reasonably well, its cast shadows are often not photometrically plausible. They tend to appear overly sharp and disconnected from the scene geometry, failing to form the soft, physically-correct penumbras that our method achieves. This highlights the advantage of our generative approach in learning a more complete and realistic light transport model.
Fig. 2. Qualitative comparison with illumination estimation and editing methods.
From top to bottom, the rows display results for the following scenes: Signal, Amongus, Hotdogs, Space, Template, and Room
Fig. 3. Visualizing the continuity and consistency of relighting under smoothly varying illumination.
a Shows the full image sequences as the light source moves, and (b) displays the corresponding zoomed-in regions (ROI). In both panels, the rows correspond to Ground Truth, Intrinsic, IC-Light, NRHints, and Ours (from top to bottom). Our method (bottom row) demonstrates a smooth and physically plausible transition of highlights and shadows, achieving superior realism. Results from other learning-based methods, IC-Light (3rd row) and NRHints (4th row), are also shown for comparison. While both are designed for automatic relighting, methods like NRHints face challenges in generating such continuous sequences smoothly, as they often require per-scene optimization or manual parameter specification for distinct conditions
Experimental setup
The proposed framework is designed to relight current 3D neural representations in minimal on-the-fly environmental observations. Specifically, the method requires only a single view of the target environment that differs from the baked-in illumination of the 3D representation. From this single view, the COP module is first proposed to infer the dominant light properties (intensity and direction) to discriminate a target light domain. A rendered image from the original 3D representation is then fed into the generative model to guide the synthesis of a photorealistic image under the new lighting. This synthesized view subsequently updates the neural representation to be consistent with the target illumination. To strictly evaluate this method, we compare it against several state-of-the-art methods, although their foundational principles reveal their inherent limitations for our specific task.
Intrinsic image decomposition methods32 operate by simplifying the rendering process into a 2D image-space model. This approach fundamentally diverges from our goal, as it does not operate on or update a 3D scene representation (Eq. (5)). Instead of solving for an updated representation , it attempts to decompose a single rendered view. Moreover, to relight the scene, this approach requires a complete, geometrically-aligned target shading map (Starget), but it offers no mechanism to infer this map, a proxy for the full incident light field Li,target in Eq. (4), from a novel observation. To fairly evaluate its best-case performance, our experiments therefore provide this method with the ground-truth shading map corresponding to the target light domain.
PNRNet39 tackles any-to-any relighting as an image-to-image translation task, rather than a 3D representation update problem. Its formulation explicitly requires comprehensive geometric information, such as the surface normal nx, to be provided as external inputs. This requirement of pre-existing geometry, which corresponds to the term n in the Rendering Equation (Eq. (2)), means it cannot operate directly on an implicit neural field where geometry is part of the learned representation θ. Consequently, it cannot solve for the updated parameters as defined in our objective (Eq. (6)). To ensure a fair comparison, we provided PNRNet with the ground-truth illumination image and depth map in our experiments.
IC-Light40 introduces a powerful diffusion-based approach grounded in the principle of light transport linearity, which follows from the Rendering Equation (Eq. (2)). While physically sound, this approach operates fundamentally in 2D image space. Its mechanism constrains the latent representation of individual 2D images, or slices of the outgoing light field Lo. This differs from our goal of updating a complete 3D scene representation (Eq. (6)), which is essential for generating a new, view-consistent 4D light field. It is not designed to maintain the multi-view geometric and photometric consistency that is the hallmark of a true light field representation as conceptualized in Eq. (1). For our comparison, we provided IC-Light with the target illumination and textual description.
NRHints41 conditions a neural radiance field on the light position l. While it operates on a 3D representation, its per-scene optimization approach presents a key limitation. The network weights are trained to reproduce a specific set of images, effectively entangling the scene’s implicit BRDF fr,implicit with the distribution of the training illumination Li,training. This lack of disentanglement prevents the model from generalizing to an arbitrary target illumination, Li,target, especially one with properties like intensity outside its training distribution. It therefore cannot reliably solve for a general-purpose as required by our core task in Eq. (6). Therefore, to conduct a fair comparison, we trained multiple NRHints models, one for each distinct target light intensity, providing ground-truth camera poses and light parameters for inference. Note that we adopted a per-intensity training protocol for NRHints to evaluate its upper-bound performance. Although NRHints suggests that illumination intensity can be handled via linear scaling, its per-scene optimization tends to entangle the implicit BRDF with the illumination distribution encountered during training. In our experiments, forcing a single NRHints model to cover a wide dynamic range of intensities often results in unstable optimization and suboptimal convergence, which is consistent with the difficulty of fitting one network under drastically varying gradients. Accordingly, training specialist models for each target intensity provides an advantage and reflects the baseline’s best-case performance under its most favorable conditions.
In summary, to create a fair and rigorous benchmark, we tailored the inputs for each competing method to best suit its architecture and often provided ground-truth information to evaluate their optimal performance. It is important to note that this protocol establishes an upper-bound performance baseline for methods that rely on privileged auxiliary priors (e.g., ground-truth target shading for Intrinsic and ground-truth depth for PNRNet). In practical single-view applications where such priors are unavailable, their performance degrades substantially (see Table S1 in the Supplementary information). To comprehensively evaluate the synthesis quality, we employ three standard metrics: Peak Signal-to-Noise Ratio (PSNR) to measure pixel-level signal fidelity, Structural Similarity Index (SSIM) to assess structural preservation, and Learned Perceptual Image Patch Similarity (LPIPS)46 to quantify human perceptual realism. The following quantitative and qualitative analyses will demonstrate our framework’s superior performance and flexibility, especially in the practical and challenging context of single-view, uncalibrated relighting.
Quantitative analysis in Table 2 details our synthesizer’s performance against state-of-the-art methods, demonstrating superiority across key metrics of fidelity and perceptual realism, which are crucial for viewer immersion on a high-fidelity display.
Table 2.
Quantitative comparison of generative synthesis methods
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Runtime ↓ |
|---|---|---|---|---|
| Intrinsic32 | 20.78 | 0.8334 | 0.0971 | 21.12 |
| PNRNet39 | 21.85 | 0.8056 | 0.0984 | 1.28 |
| IC-Light40 | 18.90 | 0.7981 | 0.1283 | 7.26 |
| NRHints41 | 22.56 | 0.8230 | 0.1805 | 59.83 |
| Ours(w/o Dtext) | 18.77 | 0.5409 | 0.2409 | / |
| Ours(w/o svisual) | 23.16 | 0.6950 | 0.1030 | / |
| Ours | 24.59 | 0.8205 | 0.0616 | 0.15 |
Our method is benchmarked against several imaging and relighting methods. It achieves competitive performance, notably attaining the best learned perceptual image patch similarity (LPIPS)46 score, which indicates superior perceptual quality. PSNR is measured in decibels (dB) and Runtime in seconds (s). The semantic interpreter takes about 0.5 s per call. It is triggered only by detecting significant illumination changes, thus not affecting the per-frame rendering rate. The best result in each column is highlighted in bold. Arrows indicate whether higher (↑) or lower (↓) values are better
Our primary result is the direct, view-by-view synthesis quality, where we achieve the highest average PSNR of 24.59. Crucially, our method excels in perceptual realism, achieving the best LPIPS score of 0.0616 by a substantial margin. For perceptual performance valuation, this metric is arguably the most critical as it quantifies alignment with human perception and the generation of physically plausible optical phenomena, such as soft penumbras and accurate specular highlights-paramount for immersive experiences.
In the domain of structural similarity, while a decomposition-based method like Intrinsic attains the highest SSIM, this performance relies on the idealized assumption of perfect view alignment discussed previously. In contrast, our method’s highly competitive SSIM of 0.8205 demonstrates its robustness in more realistic, view-misaligned scenarios. Furthermore, our method consistently outperforms or remains highly competitive against other specialized models like PNRNet, IC-Light, and NRHints across all metrics. Under this strengthened upper-bound setting, Table 2 shows that our single unified model still achieves superior overall fidelity and perceptual realism (PSNR 24.59 dB vs. 22.56 dB, and LPIPS 0.0616 vs. 0.1805), demonstrating the practical advantage of our framework for immersive interaction where maintaining and switching among multiple per-condition optimized models is infeasible.
In summary, the synergistic achievement of leading in both pixel-level fidelity (PSNR) and perceptual-physical realism (LPIPS), coupled with demonstrated robustness (SSIM), shows our generative model as a robust engine for high-fidelity, view-specific radiance field computation. This sets a solid foundation for the final reconstruction of the full 4D light field.
Synthesis of the full 4D adaptive light field
The final and definitive stage of our framework is to elevate the synthesized 2D views into a complete and globally coherent 4D light field, implicitly encoded in a new 3D neural scene representation, . This process transforms the collection of individual, view-specific syntheses into a unified, continuous model that fully embodies the target illumination from any viewpoint.
The process begins by leveraging our validated modules. The COP module infers the real-world illumination, which then guides the GLTS to synthesize a complete set of photometrically consistent target images, . A critical challenge arises at this step of generative synthesis, while achieving high photometric realism, does not guarantee that the synthesized images are perfectly aligned with the original geometric camera parameters {pk}. Subtle, non-linear transformations inherent in the generative process can disrupt the strict spatio-photometric consistency required for high-fidelity 3D reconstruction.
To solve this problem, we formulate the final reconstruction not as a simple retraining, but as a joint optimization problem over both the scene representation and its corresponding camera geometry. Our goal is to find a new, self-consistent pair of a scene representation and a corresponding set of camera poses that best explains the target appearance of the synthesized images . This optimization seeks to minimize the discrepancy between renderings from the new model and our synthesized targets:
| 13 |
where is a photometric loss function. This formulation implicitly solves for the optimal camera geometry that aligns with the photometric reality of the synthesized images, rather than relying on the potentially misaligned original poses. In practice, this joint optimization can be realized by leveraging modern structure-from-motion and neural rendering pipelines that co-optimize scene parameters and camera extrinsics. This ensures the resulting 3D model is not just a collection of stylized images, but a true, continuous, and geometrically sound representation of the scene’s appearance under the new illumination.
The interplay between these three stages and their underlying network architectures is visualized in Fig. 4. This illustrates the end-to-end data flow from initial perception to final reconstruction. The definitive result of this entire pipeline is a reconstructed 3D asset whose rendered appearance blends seamlessly into the real environment. This end-to-end result demonstrates that our framework successfully synthesizes a globally coherent 4D light field. It does not merely perform 2D image stylization but fundamentally reconstructs the optical and geometric properties of the scene under new illumination. The ability to create these high-fidelity, adaptive digital light fields from one single view of real-world observation is a prerequisite for truly immersive mixed reality experiences, marking a step forward in generating dynamic content for computational displays.
Extension to holographic display applications
While our primary focus is on high-fidelity 3D relighting, we further explored the potential of our framework for driving computational holographic displays. We utilized the synthesized images as inputs for the Tensor Holography pipeline5,8. The resulting holographic reconstructions achieved an average PSNR of 24.19 dB and SSIM of 0.4273. These metrics indicate that our method can generate high-quality source content suitable for calculating computer-generated holograms (CGH). Furthermore, as demonstrated in Fig. 3, our method maintains visual smoothness and plausible shadow transitions under continuously moving light sources. This temporal stability suggests that our framework holds promise for supporting dynamic and immersive holographic interactions, where consistent visual cues are essential for a comfortable viewing experience.
Discussion
In this work, we have presented a complete computational imaging framework for adaptive digital light field synthesis and interactive displays. Our results demonstrate that it is possible to solve the challenging inverse problem of inferring environmental illumination from one single view of low-dynamic-range image. Crucially, this inferred information can then drive a generative process to reconstruct a globally coherent, relit 3D neural scene representation. To show the quality of this reconstruction, we analyzed the photometric consistency of the updated 3D representation against the original, static representation. Qualitatively, the original representation retains “baked-in” illumination, leading to static highlights and incorrect shadows when the environment changes. In contrast, our updated representation correctly localizes specularities and shadows according to the new light direction. Quantitatively, this update yields a significant performance boost, improving PSNR from 17.99 dB to 23.54 dB (see Table S5 in the Supplementary information), suggesting that the framework fundamentally upgrades the 3D asset to maintain global photometric consistency rather than performing superficial image-space filtering. The key significance of our approach lies in shifting the paradigm from direct, hardware-intensive light field capture (e.g., using HDR probes) to a more flexible and accessible computational one. By synergistically combining a multi-stage optical perception module, a hybrid-guided generative synthesizer, and a joint geometry-appearance optimization loop, our framework effectively “unfreezes” the static, baked-in illumination of neural representations, and unlock their potential for dynamic and interactive visual experiences on modern displays. Furthermore, the framework is architected to accommodate practical sensing limitations common in consumer hardware. In detail, the multi-scale feature encoder inherently suppresses high-frequency sensor noise through its downsampling and attention mechanisms. Simultaneously, the NIMI module is designed to mitigate dynamic range limitations by mapping compressed, tone-mapped LDR inputs back to a linearized radiometric manifold, allowing for the inference of plausible intensity values despite potential highlight clipping. This robustness is also supported by quantitative results as shown in Table S3 in the Supplementary information. Specifically, even under severe additive noise (σ = 100), the degradation remains moderate (PSNR drops from 24.74 to 23.41 dB and LPIPS increases from 0.086 to 0.096), indicating stable performance under practical sensor noise. Moreover, under limited dynamic range caused by highlight clipping, NIMI provides a clear benefit. For top 10% clipping, using NIMI improves PSNR from 22.74 to 23.98 dB and reduces LPIPS from 0.097 to 0.091 compared to the variant without NIMI.
The success of our framework is rooted in three key design principles that address the core challenges of data-scarce relighting. First, our two-stage COP module robustly handles the ill-posed nature of light estimation. It first infers a structured latent representation of the lighting. Crucially, the subsequent semantic interpreter then acts as a powerful generative prior, refining these potentially unstable latent parameters into a physically-plausible numerical output while simultaneously translating them into a robust textual descriptor. This estimate-and-refine approach proved far more effective in guiding the generative process than relying on a potentially unstable raw vector. Second, GLTS moves beyond simple domain translation by adopting a specialized training objective. By focusing the model’s entire learning capacity on modeling the light transport of a single object, it learns a more accurate implicit BRDF, which can bridge high-level parametric control with fine-grained visual realism. Finally, by formulating the reconstruction as a joint optimization problem, we explicitly address the spatio-photometric inconsistencies introduced by the generative step, ensuring the final 3D model is not only photometrically plausible but also geometrically sound.
Our work is unique in the intersection domain. Traditional relighting techniques often depend on capturing full panoramic images, while producing high-quality results, which are impractical for the casual, on-the-fly capture scenarios envisioned for consumer immersive interaction. Recent advances in intrinsic decomposition for neural fields have made significant progress, yet they typically require dense multi-view supervision under varying lighting conditions, a requirement our framework explicitly eliminates. Our approach, specially designed for the challenging data-scarce scenario, holds a critical advantage for real-world deployment. The final joint optimization stage ensures that the reconstructed light field remains consistent across views and that lighting effects are geometrically correct from any novel viewpoint, which is essential for immersive experiences.
Despite its successes, our framework has several limitations that present opportunities for future research. First, our COP module currently models the incident illumination as a single dominant directional source. This formulation relies on estimating an effective dominant light aligned with the dominant shadow-casting source, a choice necessitated by three fundamental obstacles in single-view inverse rendering. First, a single image presents an entanglement of unknown geometry, material, and lighting47, making the decomposition of complex lighting theoretically ill-posed without multi-view constraints48. Second, from a data perspective, explicitly modeling every permutation of multiple light positions and colors leads to a combinatorial explosion. Acquiring a real-world dataset that covers this exponentially growing space with ground truth is practically intractable49. Finally, regarding model stability, high-capacity representations (e.g., high-order Spherical Harmonics) significantly increase the parameter space. Under sparse single-view cues, such models are prone to overfitting or unstable hallucinations50,51, whereas our compact directional model acts as a robust regularizer for stable synthesis. This simplification, while effective for many common scenes, struggles to represent complex environments with multiple significant light sources of varying colors. Additionally, in a single-view setting, the distance and intensity of a light source cannot be determined from appearance alone. A weaker nearby light and a stronger distant light can produce similar shadows on the object. Therefore, our method does not aim to recover the exact position of the light source. Instead, the COP module estimates an effective dominant illumination (direction and irradiance) from observable shadows and reflectance. We use a distant-light assumption so that this illumination is approximately consistent across the object. This design avoids the ambiguity between distance and intensity while providing stable lighting control for GLTS to synthesize photometrically consistent results. Second, while our GLTS is powerful, its ability to represent highly complex, non-Lambertian materials with anisotropic properties or significant subsurface scattering has not been fully explored. The accuracy of relighting such materials may be limited by the expressive power of the generative network. Finally, the iterative nature of the joint optimization loop, while ensuring high quality, is not yet optimized for real-time performance on mobile devices, which remains a crucial challenge for live immersive interaction applications.
These limitations directly motivate several promising future directions. The dominant light model could be extended to a more expressive representation, such as a mixture of lights or a low-frequency spherical harmonics basis, by adapting the COP module to output multiple light parameter sets. For computational efficiency, future work could explore single-stage optimization techniques that jointly solve for the neural representation and its relighting in an end-to-end fashion, or leverage model distillation for deployment on resource-constrained hardware. Beyond near-eye displays, the principles of our adaptive light field synthesis framework have broad applicability in other cutting-edge display technologies. For instance, glasses-free 3D light field displays could leverage our method to render dynamic content from sparse inputs. In automotive head-up displays (HUDs), it could be used to generate virtual indicators that are correctly lit by the changing ambient conditions. Furthermore, the ability to computationally control illumination could be integrated with spatially adaptive liquid crystal optics to achieve novel visual effects and greater energy efficiency in future display systems.
In conclusion, this work presents a robust and practical computational framework that bridges the gap between one single view of optical sensing and the generation of high-fidelity content for 3D displays. By reformulating the relighting problem as a multi-stage process of perception, synthesis, and joint optimization, we have demonstrated a path towards achieving high-fidelity, editable control over the appearance of neural scene representations without the need for specialized hardware or extensive data capture. This advancement in adaptive digital light fields is poised to accelerate the development of the next generation of truly interactive and photorealistic computational display systems.
Materials and methods
Network architectures
Our framework is composed of three primary modules: the COP module, the GLTS module, and the final reconstruction pipeline.
COP As described in our Results section, the COP module operates in two stages to infer the illumination parameters (E, D, Dtext).
Stage 1: Multi-scale Inference Engine. () The core of this engine is a custom CNN encoder designed to extract features at multiple spatial resolutions. The architecture is composed of five sequential convolutional blocks which begin with a base of 48 channels, the count of which doubles after each max-pooling layer, and progressively increasing from 48 to 768 (48 → 96 → 192 → 384 → 768). Each block utilizes a 3 × 3 convolution, followed by a GELU activation function and instance normalization. Crucially, each block is enhanced with a standard channel and spatial attention mechanism to refine the feature maps by focusing on informative photometric cues. To leverage information from different resolutions, features from the intermediate blocks are extracted. These multi-scale features are then resized to a common spatial dimension and concatenated, forming a rich, aggregated feature representation. This aggregated tensor is passed through an adaptive average pooling layer and then fed to two parallel 3-layer MLP heads. The direction head outputs the 3D latent directional vector D. The intensity head outputs the latent parameters for irradiance. These outputs form the structured, implicit representation passed to the next stage. All MLPs use a 0.5 dropout rate for regularization.
Stage 2: Semantic Interpreter (). The semantic direction Dtext is generated using a vision-encoder-decoder model. We employed a pre-trained Vision Transformer as the image encoder and a pre-trained DistilGPT-252 as the autoregressive text decoder. The tokenizer for the decoder was extended to include the unique textual labels for each lighting direction presented in our dataset. The entire model is fine-tuned end-to-end to generate a textual description of the light’s direction based on the input image and the initially estimated vector D.
GLTS. The architectural design of our GLTS is centered on the principle of style-based modulation. This is a powerful technique for controlling the appearance of synthesized images. The generator’s backbone is a U-Net-like structure with skip connections, which is effective at preserving spatial details from the input view. The core of its controllability, however, lies in the use of Adaptive Instance Normalization (AdaIN) at each convolutional layer. This mechanism allows a computed style code to directly modulate the statistical properties (mean and variance) of the feature maps throughout the generator, thereby effectively controlling the visual style of the output image.
The key to our hybrid-guided relighting task is the sophisticated design of the mapping network that produces this style code. We engineered it as a multi-input module that processes all components of our perceived illumination: the textual descriptor Dtext, the numerical vector D, and the visual style code svisual (extracted from the single image via a separate lightweight CNN encoder). This network fuses these inputs to produce the final, comprehensive style code sfinal, which conditions the generator to synthesize an image with the desired light transport characteristics.
Training datasets and protocols
Synthetic Dataset Generation. All training data was synthetically generated using Blender. We created a dataset consisting of 12 distinct 3D objects with diverse materials, ranging from diffuse to specular. For each object, we rendered a set of 300 multi-view images by moving a virtual camera along a fixed Bézier curve trajectory that encircled the object. This entire rendering process was systematically repeated for a matrix of controlled lighting conditions: 6 distinct directional light sources and 6 discrete intensity levels, resulting in a comprehensive dataset for training our perception and synthesis modules.
COP Module Training. The inference engine was trained in a two-phase process to ensure stability and precision. In Phase 1 coarse pre-training, the model was trained for 100 epochs using the AdamW optimizer with a learning rate of 1e-4, a batch size of 32, and a Cosine Annealing learning rate scheduler. An early stopping mechanism with a patience of 10 epochs was employed to prevent overfitting. In Phase 2 fine-tuning, the best model from Phase 1 was fine-tuned for an additional 50 epochs with a smaller batch size of 16 and differential learning rates: 1e-6 for the encoder backbone and 5e-6 for the MLP heads. An L1 loss (mean absolute error) was used as the objective function for both phases. The semantic interpreter was trained separately for 10,000 steps with a learning rate of 1e-5 and a weight decay of 0.001, using a standard cross-entropy loss on the generated text tokens.
GLTS Module Training. The GLTS was trained following its specialized training objective. The optimization was driven by a combination of a pixel-wise L1 reconstruction loss, a perceptual LPIPS loss to encourage realistic details, and an adversarial loss with R1 regularization. This approach is similar to that used in powerful generative models, which have found diverse applications from general image synthesis53 to advanced computational microscopy and phase retrieval54. We used the Adam optimizer with β1 = 0.0, β2 = 0.99, and a learning rate of 1e-4. The training was conducted for 200k iterations with a batch size of 4.
Experimental setup for validation
3D Scene Reconstruction Pipeline. The final reconstruction, formulated as the joint optimization problem in Eq. (13), was implemented using a standard and robust pipeline. The joint optimization of poses and scene parameters is a non-convex problem. To ensure stability and prevent local optima, we employ a warm-start strategy by initializing the system with the precisely calibrated poses from the original 3D representation as a strong geometric prior. Rather than a random initialization, the pipeline performs a constrained refinement of these poses. Furthermore, robust estimation strategies (e.g., RANSAC-based outlier rejection and reprojection error gating) are taken to filter out inconsistent feature matches caused by illumination changes, ensuring that the pose updates are driven by geometric consistency. First, the set of synthesized views was processed by established open-source Structure-from-Motion software55 to solve for the self-consistent camera poses . This approach, while rooted in computer vision, parallels advances in active optical 3D sensing that also leverage computational methods for high-fidelity reconstruction from captured sensor data56. Second, these newly registered images and poses were used as input to train a NeRF model via an Instant-NGP implementation form57 to obtain the final continuous 3D representation.
Hardware and Software. For our conceptual validation and as the target application platform, we consider a modern consumer mixed reality device, the Meta Quest 3. Its dual front-facing RGB cameras provide the real-time passthrough video stream from which the sparse input images are assumed to be captured. All model training and offline validation experiments were conducted on a platform equipped with a single NVIDIA GeForce RTX 3080 Ti GPU.
Supplementary information
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant numbers 62275186, 62301353, 62201374, 62575197, 62201372, 62535003, 62401383, 62575198, 62535017 and by the Suzhou Science and Technology Planning Project - Industrial Foresight and Key Core Technology Project under Grant number SYC2022140.
Author contributions
X.H., J.X. and C.W. conceived the idea. J.X. and J.S. designed the experimental part. J.S. provided the wearable device and high performance computing cluster. X.H., J.X. and J.S. collected the data and performed the experiments. X.H., J.X., J.S., K.W., F.X., M.C., C.C. and J.P. analyzed and interpreted the results. X.H., J.X., J.S., K.W. and F.X. wrote the original manuscript. J.S. and C.W. supervised the project.
Data availability
The data that supports the findings of this study are available from the corresponding author upon reasonable request.
Conflict of interest
The authors declare no competing interests.
Footnotes
These authors contributed equally: Xuyang Hong, Jie Xie, Jie Sheng
Supplementary information
The online version contains supplementary material available at 10.1038/s41377-026-02234-4.
References
- 1.Liu, H. S. et al. Learning-based real-time imaging through dynamic scattering media. Light Sci. Appl.13, 194 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Adelson, E. H. & Bergen, J. R. The plenoptic function and the elements of early vision. In Landy, M. & Movshon, J. A. (eds.) Computational Models of Visual Processing, 3–20 (The MIT Press, Cambridge, 1991).
- 3.Levoy, M. & Hanrahan, P. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 31–42 (ACM, New Orleans, LA, USA, 1996).
- 4.Wang, B. W. et al. Single-shot super-resolved fringe projection profilometry (sssr-fpp): 100,000 frames-per-second 3d imaging with deep learning. Light Sci. Appl.14, 70 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shi, L., Li, B. C. & Matusik, W. End-to-end learning of 3d phase-only holograms for holographic display. Light Sci. Appl.11, 247 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang, Z. Q. et al. Computational optical imaging: on the convergence of physical and digital layers. Optica12, 113–130 (2025). [Google Scholar]
- 7.Mengu, D. et al. At the intersection of optics and deep learning: statistical inference, computing, and inverse design. Adv. Opt. Photonics14, 209–290 (2022). [Google Scholar]
- 8.Shi, L. et al. Towards real-time photorealistic 3d holography with deep neural networks. Nature591, 234–239 (2021). [DOI] [PubMed] [Google Scholar]
- 9.Wu, D. X. et al. Imaging biological tissue with high-throughput single-pixel compressive holography. Nat. Commun.12, 4712 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Barbastathis, G., Ozcan, A. & Situ, G. H. On the use of deep learning for computational imaging. Optica6, 921–943 (2019). [Google Scholar]
- 11.Suo, J. L. et al. Computational imaging and artificial intelligence: The next revolution of mobile vision. Proc. IEEE111, 1607–1639 (2023). [Google Scholar]
- 12.Xiong, J. H. et al. Augmented reality and virtual reality displays: emerging technologies and future perspectives. Light Sci. Appl. 10, 216 (2021). [DOI] [PMC free article] [PubMed]
- 13.Solomashenko, A. B. et al. Industrial applications of ar headsets: a review of the devices and experience. Light Adv. Manuf. 6, 358–387 (2025).
- 14.Karsch, K. et al. Rendering synthetic objects into legacy photographs. ACM Transac. Graph. 30, 1–12 (2011).
- 15.Mania, K. et al. The effect of visual and interaction fidelity on spatial cognition in immersive virtual environments. IEEE Transac. Visual. Comput. Graph. 12, 396–404 (2006). [DOI] [PubMed]
- 16.Kajiya, J. T. The rendering equation. ACM SIGGRAPH Computer Graphics 20, 143–150 (1986).
- 17.Cook, R. L. & Torrance, K. E. A reflectance model for computer graphics. ACM Transac. Graph. 1, 7–24 (1982).
- 18.Zhang, J. Y. o. Single image relighting based on illumination field reconstruction. Opt. Express31, 29676–29694 (2023). [DOI] [PubMed]
- 19.Hu, X. M. et al. Robust and accurate transient light transport decomposition via convolutional sparse coding. Opt. Lett. 39, 3177–3180 (2014). [DOI] [PubMed]
- 20.Kim, T. S. et al. Future trends of display technology: micro-leds toward transparent, free-form, and near-eye displays. Light Sci. Appl. 14, 335 (2025). [DOI] [PMC free article] [PubMed]
- 21.Ding, Y. Q. et al. Breaking the in-coupling efficiency limit in waveguide-based ar displays with polarization volume gratings. Light Sci. Appl. 13, 185 (2024). [DOI] [PMC free article] [PubMed]
- 22.Tian, Z. T. et al. An achromatic metasurface waveguide for augmented reality displays. Light Sci. Appl.14, 94 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Luo, Z. Y. et al. Achromatic diffractive liquid-crystal optics for virtual reality displays. Light Sci. Appl.12, 230 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yin, K. et al. Advanced liquid crystal devices for augmented reality and virtual reality displays: principles and applications. Light Sci. Appl.11, 161 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kress, B. C. & Pace, M. Holographic optics in planar optical systems for next generation small form factor mixed reality headsets. Light Adv. Manuf.3, 771–801 (2022). [Google Scholar]
- 26.Park, J.-H. & Lee, B. Holographic techniques for augmented reality and virtual reality near-eye displays. Light. Adv. Manuf.3, 137–150 (2022). [Google Scholar]
- 27.Zhang, X. M. et al. Nerfactor: neural factorization of shape and reflectance under an unknown illumination. ACM Trans. Graph.40, 237 (2021). [Google Scholar]
- 28.Srinivasan, P. P. et al. Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7491–7500 (IEEE, Nashville, TN, USA, 2021).
- 29.Verbin, D. et al. Ref-nerf: structured view-dependent appearance for neural radiance fields. IEEE Trans. Pattern Anal. Mach. Intell.47, 9426–9437 (2025). [DOI] [PubMed] [Google Scholar]
- 30.Mildenhall, B. et al. Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM65, 99–106 (2022). [Google Scholar]
- 31.Kerbl, B. et al. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42, 139 (2023). [Google Scholar]
- 32.Bell, S., Bala, K. & Snavely, N. Intrinsic images in the wild. ACM Trans. Graph.33, 159 (2014). [Google Scholar]
- 33.Wang, F. et al. Single-pixel imaging using physics enhanced deep learning. Photonics Res.10, 104–110 (2021). [Google Scholar]
- 34.Lalonde, J.-F. et al. Photo clip art. ACM Trans. Graph.26, 3 (2007). [Google Scholar]
- 35.Gardner, A. et al. Linear light source reflectometry. ACM Trans. Graph.22, 749–758 (2003). [Google Scholar]
- 36.Song, S. R. & Funkhouser, T. Neural illumination: Lighting prediction for indoor environments. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6911–6919 (IEEE, Long Beach, CA, USA, 2019).
- 37.Shen, S. Y. et al. Illumidiff: indoor illumination estimation from a single image with diffusion model. IEEE Trans. Vis. Computer Graph.31, 7752–7768 (2025). [DOI] [PubMed] [Google Scholar]
- 38.Ponglertnapakorn, P., Tritrong, N. & Suwajanakorn, S. Difareli: Diffusion face relighting. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 22589–22600 (IEEE, Paris, France, 2023).
- 39.Hu, Z. Y. et al. Pnrnet: Physically-inspired neural rendering for any-to-any relighting. IEEE Trans. Image Process.31, 3935–3948 (2022). [DOI] [PubMed] [Google Scholar]
- 40.Zhang, L. M., Rao, A. Y. & Agrawala, M. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In Proc. 13th International Conference on Learning Representations (OpenReview.net, Singapore, 2025).
- 41.Zeng, C. et al. Relighting neural radiance fields with shadow and highlight hints. In ACM SIGGRAPH 2023 Conference Proceedings, 73 (ACM, Los Angeles, CA, USA, 2023).
- 42.Sharma, A., Tan, R. T. & Cheong, L.-F. Single-image camera response function using prediction consistency and gradual refinement. In Proc. 15th Asian Conference on Computer Vision, 19–35 (Springer, Kyoto, Japan, 2021).
- 43.Grossberg, M. D. & Nayar, S. K. Determining the camera response from images: What is knowable?. IEEE Trans. Pattern Anal. Mach. Intell.25, 1455–1467 (2003). [Google Scholar]
- 44.Banterle, F. et al. A framework for inverse tone mapping. Vis. Computer23, 467–478 (2007). [Google Scholar]
- 45.Reinhard, E. et al. Photographic tone reproduction for digital images. ACM Trans. Graph.21, 267–276 (2002). [Google Scholar]
- 46.Zhang, R. et al. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 586–595 (IEEE, Salt Lake City, UT, USA, 2018).
- 47.Barron, J. T. & Malik, J. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell.37, 1670–1687 (2015). [DOI] [PubMed] [Google Scholar]
- 48.Ramamoorthi, R. & Hanrahan, P. A signal-processing framework for inverse rendering. k In Proc. 28th Annual Conference on Computer Graphics and Interactive Techniques, 117–128 (ACM, Los Angeles, CA, USA, 2001).
- 49.Debevec, P. The light stages and their applications to photoreal digital actors. In SIGGRAPH Asia (ACM, Singapore, 2012).
- 50.Basri, R. & Jacobs, D. W. Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell.25, 218–233 (2003). [Google Scholar]
- 51.Zhang, J. S. et al. All-weather deep outdoor lighting estimation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10150–10158 (IEEE, Long Beach, CA, USA, 2019).
- 52.Sanh, V. et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2019).
- 53.Karras, T. et al. Analyzing and improving the image quality of stylegan. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8107–8116 (IEEE, Seattle, WA, USA, 2020).
- 54.Rivenson, Y. et al. Deep learning microscopy. Optica4, 1437–1443 (2017). [Google Scholar]
- 55.Schönberger, J. L. & Frahm, J.-M. Structure-from-motion revisited. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4104–4113 (IEEE, Las Vegas, NV, USA, 2016).
- 56.Chen, W. W. et al. Deep-learning-enabled temporally super-resolved multiplexed fringe projection profilometry: high-speed khz 3d imaging with low-speed camera. PhotoniX5, 25 (2024). [Google Scholar]
- 57.Müller, T. et al. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph.41, 102 (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that supports the findings of this study are available from the corresponding author upon reasonable request.




