Abstract
With the rapid advancement of Unmanned Aerial Vehicle applications, vision-based 3D scene reconstruction has demonstrated significant value in fields such as remote sensing and target detection. However, scenes captured by UAVs are often large-scale, sparsely viewed, and complex. These characteristics pose significant challenges for neural radiance field (NeRF)-based reconstruction. Specifically, the reconstructed images may suffer from blurred edges and unclear textures. This is primarily due to the lack of edge information and the fact that certain objects appear in only a few images, leading to incomplete reconstructions. To address these issues, this paper proposes a hybrid image encoder that combines convolutional neural networks and Transformer to extract image features to assist NeRF in scene reconstruction and generate new perspective images. Furthermore, we extend the NeRF architecture by introducing an additional branch that estimates uncertainty values associated with transient regions in the scene, enabling the model to suppress dynamic content and focus on static structure reconstruction. To further improve synthesis quality, we also refine the loss function used during training to better guide network optimization. Experimental results on a custom UAV aerial imagery dataset demonstrate the effectiveness of our method in accurately reconstructing and rendering UAV-captured scenes.
Subject terms: Mathematics and computing, Computer science
Introduction
In recent years, with the rapid advancement and widespread adoption of unmanned aerial vehicles (UAVs), the 3D reconstruction of UAV-acquired scenes has emerged as a significant research focus. Multi-view image-based 3D reconstruction techniques are essential in a wide range of remote sensing applications, including urban planning and disaster assessment. By constructing detailed outdoor 3D scenes, these techniques not only improve analytical efficiency but also provide more reliable data support for related applications. In UAV aerial photography scenarios, traditional 3D reconstruction methods often fail to meet reconstruction requirements due to factors such as the UAV’s altitude and changes in viewing angle. Traditional 3D reconstruction methods based on SLAM or multi-view stereo, such as structure-from-motion1 (SfM) or image-based rendering2 (IBR), often rely on dense image acquisition and complex post-processing. The SfM method is prone to unstable feature matching when handling large-scale aerial images, which can reduce the accuracy of the resulting sparse point clouds. Similarly, the IBR method may struggle to maintain geometric consistency across different viewpoints, making it less effective for processing sparse and dynamic UAV data in complex environments. In contrast, with the continuous development of neural rendering technology, Neural Radiance Fields3 (NeRF) have been proposed as a continuous implicit representation method for 3D scenes. It does not rely on explicit geometric modeling, but learns the mapping between light and color in an end-to-end manner. Given the differences between NeRF and conventional approaches, there is growing interest in applying NeRF to remote sensing tasks. When applied to UAV-captured scenes, NeRF can generate detailed and realistic images from novel viewpoints while capturing fine variations in lighting and texture. Furthermore, unlike traditional methods that require extensive sensor data and complex geometric computations, NeRF can achieve high-quality reconstruction with relatively sparse input images. This makes it more suitable for UAV scenarios, where viewpoints are dispersed and image data may be limited, offering greater efficiency and flexibility in large-scale aerial reconstruction. When using NeRF for scene reconstruction from UAV-captured images, three main issues arise: First, UAVs often capture images from high altitudes with limited side views, leading to sparse viewpoints and insufficient geometric information, which in turn results in distortions in the reconstructed geometry. Second, due to the wide acquisition range during UAV photography, some areas in the captured images suffer from insufficient coverage. While objects in the center of the images are relatively clear, those at the edges often lack detail, and in some cases, may be partially missing or blurred. This leads to missing information in low-overlap regions and results in holes or blurring in the reconstruction. Finally, moving objects such as pedestrians or vehicles may appear in UAV-captured scenes, but these are not the targets of reconstruction. Their presence introduces a certain degree of randomness in the input data, which can lead to artifacts in the NeRF-generated results. Our main contributions are as follows:
We propose a hybrid image encoder, HybridDroneEncoder, which integrates Convolutional Neural Networks (CNNs) and Transformer architectures. This encoder leverages the local feature extraction capability of CNNs and the global context modeling strength of Transformers to extract semantically rich features for scene representation. These features are used to enhance NeRF-based 3D reconstruction.
We extend the original NeRF architecture by introducing an additional MLP designed to predict uncertainty values for transient regions in UAV images. This allows the model to identify and suppress dynamic content, thereby enhancing the rendering quality of the static scene.
We optimizd the network’s loss function to improve training effectiveness and reconstruction quality. Additionally, we introduce Random Fourier Feature Encoding (RFFE), which combines deterministic and stochastic components to enhance positional encoding for capturing high-frequency details.
Related work
Novel view synthesis
Novel View Synthesis (NVS) aims to generate photorealistic renderings of a 3D scene from arbitrary viewpoints, given a set of images or video frames of the same scene along with their corresponding camera poses. raditional NVS approaches are geometry-based and typically employ Structure-from-Motion (SfM) and Bundle Adjustment (BA)1 to establish multi-view geometric consistency. These methods reconstruct a sparse point cloud via feature matching and triangulation, which is then converted into a 3D model using voxel grids or mesh surfaces. While such representations are geometrically interpretable, they are highly sensitive to matching quality and often fail in the presence of low-texture regions, repeated patterns, or occlusions. In addition, the quality of the rendered images is constrained by the sampling density and texture resolution. An alternative strategy involves Light Field methods, which capture scenes using densely sampled viewpoint grids and represent them using 4D or 6D light fields. These methods enable fast rendering and can accurately reproduce complex lighting effects. However, they demand large-scale perspective data, leading to high acquisition and storage costs, which limits their scalability in practical applications. Inspired by Layered Depth Images (LDI), researchers have proposed discretized volumetric representations such as Multi-Plane Images4 (MPI) and Multi-Sphere Images5 (MSI), which synthesize novel views using alpha compositing or differentiable ray marching. Compared to these approaches, volumetric scene representation methods leverage end-to-end neural networks to model the light transport process, without relying on explicit 3D reconstruction or geometric inference. By implicitly learning the mapping from rays to colors through neural networks, these methods can synthesize novel views with realistic lighting and fine details from multi-view image sets. Among these, Neural Radiance Fields (NeRF)6 has emerged as a breakthrough, particularly excelling in weakly textured regions and complex surface materials. NeRF preserves geometric coherence and reduces common artifacts seen in traditional methods, such as holes or ghosting, leading to synthesized images that closely resemble real photographs. This has ushered in a new paradigm for implicit scene reconstruction.
Neural rendering
Neural rendering7, as an innovative paradigm at the intersection of computer graphics and deep learning, enables high-fidelity and differentiable image synthesis by modeling the geometry and appearance of 3D scenes through neural networks. By deeply integrating traditional rendering techniques with neural representations, it overcomes the limitations of conventional methods that heavily rely on explicit geometry. Leveraging the powerful representational capacity of deep neural networks, this approach optimizes the color and density distributions of scene elements through backpropagation, enabling accurate and efficient 3D reconstruction. Furthermore, neural rendering architectures designed with geometric awareness can jointly infer scene geometry and material properties from multi-view images, allowing for detailed reconstruction of complex scenes. Currently, several works incorporate different methods into the rendering process to construct explicit geometric representations, such as meshes8, point clouds9, or Signed Distance Functions10 (SDFs) and subsequently optimizing the scene through the integration of texture mapping networks. This hybrid model can visually represent object shapes and details, but it highly depends on prior information and computational resources. An alternative approach uses implicit representations, such as NeRF, which encode both geometry and appearance using continuous functions. Without the need for explicit geometric structures, these models offer compact and efficient 3D representations. Notably, NeRF can still generate high-quality novel views even under sparse viewpoints or incomplete data, making it suitable for challenging real-world scenarios. In recent years, researchers have optimized and expanded NeRF in multiple dimensions. Model-based improvements11–15 address issues like blurry or jagged images in the original NeRF during generation by altering the generation method. For example, Mip-NeRF11,12 introduces conical frustum sampling and integrated positional encoding to achieve a multi-scale representation, improving rendering quality and anti-aliasing; Ref-NeRF13 explicitly models surface normals and reflective lighting, decomposing the radiance field into diffuse reflection, specular highlights, and ambient light components. In terms of accelerating training and rendering, explicit or hybrid methods are used to speed up training16–26. For instance, Instant-NGP18proposed multi-resolution hash coding, which combines a spatial hash table with a small MLP to replace the original large MLP; Plenoxels19 optimizes a sparse voxel grid where each voxel stores the scene density and spherical harmonic coefficients for color representation. The color and density for a given sampling point are obtained by trilinear interpolation of the coefficients of neighboring voxels. In the modeling scenes with sparse viewpoints27–31, PixelNeRF27 achieves scene reconstruction from one or several images by learning prior knowledge of the scene; MVSNeRF28 uses a pre-trained convolutional neural network to extract features and then maps them to a voxel and generates an implicit code for each point by interpolation. For unbounded scene modeling, NeRF++32 divides the scene space into an internal unit voxel and an external sphere to better represent large-scale environments. For outdoor real-photo reconstruction33–38, NeRF-W33 uses appearance encoding for implicit image representation, excluding external interference; HaNeRF36 introduces an appearance hallucination module and an occlusion-robust rendering module, enabling free-viewpoint rendering and improved appearance synthesis, making NeRF perform better on real-world data; IE-NeRF35 integrates image inpainting and transient mask guidance to further enhance scene reconstruction.
Frequency in neural representations
One of the core challenges in neural rendering and implicit neural representations–such as Neural Radiance Fields (NeRF) and Signed Distance Functions (SDFs)–is efficiently encoding both high-frequency details and low-frequency structures. Neural networks are inherently biased toward learning low-frequency information, while real-world scenes typically involve complex geometries and abrupt color transitions. Using raw spatial coordinates and viewing directions as network inputs typically results in poor representation of high-frequency signals. To address this, the design of frequency encoding plays a critical role in determining the network’s expressive power, training efficiency, and reconstruction quality. NeRF mitigates this issue by employing positional encoding, which maps low-dimensional spatial inputs into a high-dimensional feature space using a series of sinusoidal functions across multiple frequency bands3,39,40. This explicit high-frequency injection alleviates the spectral bias of deep networks and improves their ability to capture fine-grained details. By nonlinearly expanding the distance between nearby coordinates in the feature space, positional encoding enhances the model’s capacity to represent high-gradient regions, such as image edges. However, deterministic sinusoidal encoding is limited by its fixed frequency intervals, which may result in incomplete spectral coverage and artifacts in regions with complex textures. In contrast, Random Fourier feature encoding (RFFE)39 constructs the frequency mapping by sampling from a Gaussian distribution. Compared to deterministic schemes, this stochastic spectral sampling strategy provides more uniform frequency coverage via Monte Carlo sampling, thereby improving the robustness and expressiveness of learned features.
Preliminary
Neural Radiance Fields (NeRF) utilizes multilayer perceptrons(MLPs) to implicitly represent 3D scenes. The input to the network is a 5D vector, comprising the 3D position
of a point in space and the 2D viewing direction
. The output is the color and volume density
at that point. Specifically, NeRF consists of two MLPs. The first MLP estimates the volume density at the input spatial coordinate
and produces an implicit vector
. The second MLP takes as input the implicit vector
and the view direction
to predict the color values:
![]() |
1 |
Where
is the learnable parameters of the network,
=(x,y,z) represents the 3D position,
=(
,
) is the view direction,
=(r, g, b) is the color of the point. Since the color is not necessarily the same at different viewing angles, the color value is related to the viewing direction d and the spatial position
. To improve the network’s ability to capture high-frequency details, NeRF applies a positional encoding function that maps the low-dimensional inputs to a high-dimensional space using sinusoidal functions:
![]() |
2 |
Where L is the number of encoding layers. For the trained model, NeRF uses volume rendering to synthesize images from new viewpoints. To compute the color of each pixel in the target image, a ray
is emitted from each pixel, where
represents the camera center and
is the direction from the camera center to the pixel. This ray traverses the scene, and the color and density along its path are integrated to yield the final pixel color. However, since continuous integration is difficult to compute, NeRF approximates the calculation using discrete sampling:
![]() |
3 |
where
and
are the color and volume density values at point
along the ray
, and
is the distance between two adjacent sample points on the same ray. the value of
= 1 - exp(-
)is computed to determine the probability of the ray passing through this point, i.e., the probability that the ray stops at this point, and
is the cumulative transmittance value of the ray from the near plane to the current point. To train the MLPs to accurately reconstruct the target scene, NeRF minimizes the mean squared error (MSE) between the predicted and ground-truth pixel colors. Given a randomly sampled batch of rays R, the loss function is defined as:
![]() |
4 |
where
is the ground truth color at pixel (i, j) in the reference image, and
is the rendered color predicted by the network.
Method
HIF-NeRF network
For novel view synthesis in UAV-captured scenes, this work leverages Neural Radiance Fields (NeRF) to enable effective scene reconstruction under sparse-view conditions. Figure 1 shows the overall framework we use for training. The scene data is first processed and fed into the network for training. Once trained, novel-view images can be synthesized by inputting a new camera pose into the model. To fully exploit information from input images, we propose a hybrid image encoder that integrates Convolutional Neural Networks (CNNs) with Transformer architectures, as shown in Fig. 2. This module combines the local feature extraction strengths of CNNs with the long-range dependency modeling capabilities of Transformers, thereby capturing richer and more expressive semantic features for scene representation. The encoder begins with a ResNet-style convolutional backbone that serves as the primary feature extractor. To mitigate boundary artifacts caused by zero-padding, reflection padding is applied before convolution, extending pixel values at image edges in a mirrored fashion. This helps preserve boundary continuity and improves feature quality. The convolutional backbone consists of two residual blocks, each containing two convolutional layers followed by normalization and ReLU activation. Shortcut connections are included to facilitate gradient flow and stabilize training. Spatial downsampling is achieved via convolutions, while the number of channels is progressively increased, allowing the network to extract mid- and high-level semantic features with strong representational capacity. To prepare the feature maps for Transformer processing, adaptive average pooling is applied to resize them to a fixed spatial size. The pooled feature maps are then rearranged into fixed-length sequences, which are fed into a Transformer encoder that captures global contextual relationships across image patches. The Transformer follows a standard architecture composed of multi-head self-attention and feed-forward layers, enabling it to model large-scale structural information. Finally, the Transformer outputs are projected through a fully connected layer to produce a compact, fixed-length embedding vector. This embedding serves as a global semantic representation, which is subsequently used in the NeRF-based scene modeling pipeline.
![]() |
5 |
Fig. 1.
HIF-NeRF framework: Given a set of input images, a feature extraction module generates image embeddings
. Based on the camera parameters and viewing directions, sampling is performed in 3D space to obtain spatial sample points and ray directions
, which are encoded into high-dimensional features. These are input to the first MLP, which predicts the volume density
at each sampled location. The encoded ray direction
is concatenated with the image embedding and fed into the second MLP to predict the color. During training, a transient-aware visibility map is generated by feeding random encodings
and image coordinates
into an MLP. The model is optimized using a composite loss combining color, LPIPS, and depth terms.
Fig. 2.
The hybrid image encoder first applies reflection padding to preserve boundary continuity, then extracts multi-scale features through convolutional and residual blocks with downsampling. The resulting feature maps are adaptively pooled, reshaped into fixed-size sequences, and passed through a Transformer encoder to capture global context. A projection layer outputs a fixed-length embedding used for NeRF-based scene modeling.
In the radiance field modeling stage, we adopt a geometry–appearance decoupling strategy to separately model scene structure and view-dependent appearance. In the geometry branch, 3D sample coordinates are first encoded by positional encoding, then further enriched with Random Fourier Features. The combined representation is processed by an MLP to output a latent vector h and the corresponding volume density
. This intermediate vector h is further combined with a random latent code
and input into an auxiliary MLP to predict a transient object mask, which identifies occluded or dynamic regions in the scene. In the appearance branch, the latent vector h generated by the geometry branch is concatenated with the viewing direction
and the image embedding
. The combined features are then fed into a second MLP to predict the RGB color at each sampled point.
![]() |
6 |
![]() |
7 |
where
= [
,
]represents the learnable parameters of MLP layers. The notation
represents the vector generated by mapping from the low-dimensional space to the high-dimensional space after RFF encoding. Meanwhile,
and
correspond to the feature vector extracted from the image using a CNN network and a set of randomly sampled embeddings, respectively.
Uncertainty estimation for transients
When capturing scenes using UAVs, dynamic elements such as pedestrians and vehicles may be present. Although these are not the targets of reconstruction, they can interfere with the process and introduce artifacts such as blurring, ghosting, and geometric distortions. To reduce their impact, we introduce an auxiliary MLP that estimates the uncertainty associated with transient content, enabling the model to better focus on static scene structures. This uncertainty is modeled as an implicit continuous function that takes the transient embedding
and a latent vector h as input, and returns a visibility value
:
![]() |
8 |
In this way, the model learns to identify unstable regions in the scene during training and adaptively attenuates their influence during rendering, thereby enhancing the quality and consistency of the final synthesized images.
Feature encoding
To enhance frequency representation under sparse views, we adopt positional encoding as the primary input encoding. Additionally, a set of Random Fourier Features (RFF), sampled from a Gaussian distribution, is injected into the MLP to supplement frequency diversity. This improves the network’s ability to model fine-grained details under sparse-view conditions. Figure 3 shows the visualization of positional encoding and random Fourier feature encoding.
Fig. 3.
Visualization of position coding and random Fourier coding. Uniform sampling in the interval [0, 1] generates two-dimensional grid data, which is input into position coding and random Fourier feature coding to obtain the encoded values.
Loss functions and optimization
In order to enable the model to better reconstruct the scene, this paper improves the loss function during training and constructs a multimodal joint supervision loss function. The original NeRF loss3 adopts a hierarchical color reconstruction strategy, in which both coarse and fine networks are supervised by the mean squared error (MSE) between the predicted color along each ray r and the ground-truth pixel values, thereby constraining the global color distribution of the radiance field. In our approach, the visibility probability
of transient objects is integrated into the loss function, along with an additional regularization term to balance this component, thereby enhancing the model’s robustness to dynamic scene elements.
![]() |
9 |
where
is the pixel colors predicted by the coarse and fine networks through the MLP, respectively, and
represents the corresponding ground-truth pixel color. In addition to color supervision, a depth consistency constraint is introduced. Leveraging volumetric rendering, the model simultaneously generates synthesized images and corresponding depth maps. The depth loss is defined as the L1 distance between the predicted depth and the ground-truth depth obtained from real images. This constraint helps reduce geometric blurring caused by sparse viewpoints and ensures structural consistency between the reconstructed and real-world scenes:
![]() |
10 |
where
is the depth information obtained during image prediction, and D(r) is the ground-truth depth. To compensate for the limited ability of color loss to capture high-frequency textures and to provide an optimization strategy more consistent with human visual perception in 3D scene rendering, this paper incorporates the Learned Perceptual Image Patch Similarity (LPIPS) loss41. LPIPS employs a pre-trained deep neural network to extract multi-level semantic features from both synthesized and ground-truth images, and computes their perceptual similarity by measuring the distance between corresponding feature layers. This guides the optimization to emphasize texture and detail reconstruction, thereby improving the visual realism of rendered images. Specifically, we adopt a pre-trained AlexNet to extract multi-scale feature maps and calculate the LPIPS loss as the distance between matched layers.:
![]() |
11 |
where
is the predicted image output by the network, and I is the ground-truth image. Finally, the color reconstruction loss, depth consistency loss, and LPIPS loss are integrated using a learnable weighting scheme to form the joint optimization objective:
![]() |
12 |
where
,
, and
are weight coefficients corresponding to each loss term, serving to balance their contributions during overall optimization. The weights are empirically selected based on preliminary experiments conducted on a validation set. Once selected, these weights are kept fixed throughout the training process and are not learned jointly with the model parameters. This joint supervision strategy improves the reconstruction of high-frequency textures and fine details while preserving the geometric integrity of the scene, thereby producing final renderings that better align with human visual perception.
Results
Datasets
In this paper, we used a DJI Mini 3 drone to capture images across various campus environments. A total of three scenes with distinct geometric characteristics were recorded. In addition, we incorporated two public datasets from the OMMO dataset42 collection for training, specifically selecting scenes that feature camera trajectories and viewing angles similar to our own UAV-based captures. OMMO datasets, designed for outdoor 3D reconstruction, provide high-resolution images and calibrated camera poses, making them suitable for benchmarking under sparse-view and complex-geometry conditions. During the data collection process, the UAV maintained a constant altitude of approximately 120 meters and operated under overcast conditions. This setup effectively minimized the impact of lighting fluctuations on image quality while ensuring consistency and stability across different scene acquisitions. The UAV followed a predefined flight trajectory, moving slowly and smoothly to ensure sufficient multi-angle sampling of each region. After collection, we extract frames from the video and select images based on clarity, a low overlap rate, and diverse viewpoints, ensuring that all areas in the scene are fully sampled. Additionally, as the original image resolution was high, leading to long training times, we uniformly resized all images to 800
450 pixels for efficiency. Next, we used COLMAP43 for sparse reconstruction, extracting the camera poses of each image to facilitate subsequent network training. In the data preprocessing stage, we dynamically adjusted data loading parameters based on the characteristics of different datasets. By designing an adaptive data loader, we efficiently loaded and preprocessed image data along with their corresponding ray information. The data loader was configured to handle image resizing, batch size, and perturbation parameters, while also adjusting key factors such as scaling factors. This approach ensured that data preprocessing maintained both consistency and moderate diversity, providing high-quality input data for the subsequent network training.
Implementation details
The proposed network architecture integrates hybrid image encoding with Neural Radiance Field (NeRF) modeling for high-quality novel view synthesis. Specifically, during training, the position
and the view direction d=(
,
) are independently encoded using Random Fourier Features (RFF), which map them from the original 3D space into the high-dimensional representations, respectively. The encoded features are then passed through fully connected layers for training. The scene reconstruction module consists of eight fully connected layers and ReLU activation. After the eighth layer, the network outputs the predicted volume density
and an intermediate feature vector. The intermediate vector, together with the image feature encoding and a random latent code, is passed through an additional four fully connected layers and ReLU activation. Finally, a sigmoid activation function is applied to predict the RGB color components of the synthesized image. The hybrid image encoder starts with a ResNet-style convolutional backbone as the feature extractor, performing convolution operations after reflection padding. The backbone includes two residual blocks, each consisting of two convolutional layers with normalization and ReLU activations, along with shortcut connections. Adaptive average pooling is then applied to resize the feature maps to a fixed size, making them suitable for conversion into sequential input. These feature maps are rearranged into patches of the same size and fed into a three-layer Transformer encoder. Ultimately, the output features from the Transformer are compressed through a fully connected layer into a fixed-length, low-dimensional embedding vector, which serves as the global semantic representation of the image for subsequent NeRF-based scene modeling. All experiments are conducted using two NVIDIA RTX 3090 GPUs. Training a single scene takes approximately 16 hours.
Quality assessment metrics
To evaluate the performance of different models in image reconstruction tasks, we employed several evaluation metrics, including Peak Signal-to-Noise Ratio (PSNR), one of the most widely used image quality assessment metrics. PSNR effectively quantifies pixel-level differences between the reconstructed image and the ground truth by calculating the Mean Squared Error (MSE) and converting it to a logarithmic scale. A higher PSNR value indicates greater similarity between the reconstructed and original images, as well as less loss in image quality:
![]() |
13 |
The Structural Similarity Index (SSIM) assesses the similarity between two images by comparing their luminance, contrast, and structural components. Specifically, SSIM compares luminance by calculating the mean intensity within local windows, evaluates contrast based on standard deviation, and analyzes structural similarity through covariance:
![]() |
14 |
The Learned Perceptual Image Patch Similarity (LPIPS) metric uses deep learning to assess perceptual similarity from a human visual perspective. In this approach, the two images are fed into a pre-trained AlexNet model to extract deep features from multiple layers. The perceptual distance is then computed as the L2 distance between these feature representations to quantify similarity:
![]() |
15 |
Comparisons
We evaluate our proposed method against several NeRF models in the wild, including NeRF3, NeRF-W33, and HA-NeRF36. We conducted experiments using all three models, training each network with the same set of preprocessed input images and corresponding camera poses. Due to hardware limitations, all input images were downsampled to half their original resolution prior to training. Specifically, NeRF, as the earliest and most classical neural radiance field model, learns the color and density at each point in the scene through a multi-layer perceptron (MLP) and generates novel-view images via volume rendering. This model performs well in uniformly lit, densely sampled static scenes. However, its performance tends to degrade when the input views are sparse, the scene is large-scale, or dynamic objects are present during capture.NeRF-W extends NeRF by introducing latent embeddings for both scene-level appearance and transient components, along with an uncertainty-aware rendering mechanism. These enhancements allow it to disentangle static and dynamic elements and better accommodate lighting changes and transient objects.HA-NeRF further builds upon NeRF-W by enabling the generation of novel background appearances from unobserved views. It incorporates image-aware 2D visibility prediction to reduce inconsistencies caused by dynamic content and supports free-viewpoint rendering of scenes with unseen appearance variations.
We evaluated our method by rendering images from the same camera poses as the input and comparing them with the corresponding ground truth images. As shown in Table 1, we report performance metrics including PSNR, SSIM, and LPIPS. Compared to the baselines, our method achieved better performance. Figure 4 shows visual comparisons between rendered static scenes and ground truth after training each model. Figure 5 shows the depth maps generated by each of the trained models. Figure 6 illustrates rendered outputs from each model for a fixed scene with known camera parameters.
Table 1.
Comparison of PSNR, SSIM, and LPIPS for HIF-NeRF and other models: NeRF, NeRF-W, Ha-NeRF on datasets across specific scenes (Playground, Grass, House, Sydney, and Court).
| Method | Playground | Grass | House | Sydney | Court | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PSNR
|
SSIM
|
LPIPS
|
PSNR
|
SSIM
|
LPIPS
|
PSNR
|
SSIM
|
LPIPS
|
PSNR
|
SSIM
|
LPIPS
|
PSNR
|
SSIM
|
LPIPS
|
|
| NeRF | 20.18 | 0.62 | 0.35 | 19.42 | 0.54 | 0.43 | 21.17 | 0.5 | 0.44 | 21.1 | 0.72 | 0.37 | 22.36 | 0.72 | 0.28 |
| NeRF-W | 23.26 | 0.68 | 0.34 | 20.94 | 0.57 | 0.45 | 19.56 | 0.47 | 0.51 | 21.43 | 0.72 | 0.29 | 21.83 | 0.75 | 0.21 |
| HA-NeRF | 18.96 | 0.66 | 0.38 | 18.68 | 0.67 | 0.28 | 19.19 | 0.45 | 0.51 | 18.86 | 0.67 | 0.28 | 21.67 | 0.76 | 0.17 |
| Ours | 25.61 | 0.85 | 0.21 | 25.32 | 0.73 | 0.16 | 22.92 | 0.69 | 0.26 | 21.92 | 0.74 | 0.25 | 24.40 | 0.83 | 0.12 |
Fig. 4.
Visual comparison results on five representative scenes from the datasets indicate that HIF-NeRF tends to produce more consistent 3D scene geometry and finer structural details compared to existing baseline methods. While differences may vary across scenes, the results suggest that our method provides improved geometric coherence and visual quality in most cases.
Fig. 5.
Samples of predicted static depth images of our HIF-NeRF and other baselines on the scenes of the datasets.
Fig. 6.
By reconstructing scenes from novel viewpoints not included in the training data, visualization results indicate that HIF-NeRF can generate images of good quality from unseen views. Compared with other methods, this approach also demonstrates a certain level of reliability in these scenarios.
As shown in the results, NeRF’s rendering quality declines when the input images cover wide scenes with sparse viewpoints, limited or inconsistent edge information, or when moving objects are present. It failed to reconstruct regions with weak textures, such as the black lines along roof joints in the playground and the edge of the house in field scenes. In high-gradient scenes such as sports fields, central areas may be reconstructed well, but artifacts and blurring frequently appear near the edges. For complex facades, the reconstructions appear blurry, capturing only edges with strong gradient variations. In the tower dataset, where textures are dense and gradients are weak, the generated textures often blend together. NeRF-W improves upon NeRF by introducing additional scene modeling capabilities. It can handle regions with limited information, better capture complex architectural textures, and separate static backgrounds from dynamic elements. However, in scenes with numerous objects, it may produce artifacts, and smaller objects are sometimes misclassified as dynamic or poorly reconstructed. For instance, in the playground dataset, a small patch of sand near the lights was rendered as a white fog, indicating a failure in reconstruction. Blurring is also observed near image boundaries, particularly around distant buildings, where both blurring and artifacts are evident. HA-NeRF tends to generate brighter images than the other models, showing better reconstruction in central regions but degraded performance toward the edges, where textures appear blurry or indistinct. The color contrast between adjacent areas is also more pronounced. This may be due to the limited dataset size compared to HA-NeRF’s original training set (tourist photos), potentially causing overfitting and overly bright outputs. Another reason could be that its appearance sampling is progressively reduced during training, leading to better reconstruction in the center but poorer performance at the periphery. However, in regions with strong gradient changes, HA-NeRF still performs noticeably well.
Compared to other methods, our method performs adequately in these scenarios. Although some blurring occurs in regions lacking sufficient façade or edge information, the overall textures are preserved, with no excessive blending or noticeable ghosting. These results demonstrate the practical feasibility of our method. In detailed views, the textures of houses near the image boundaries in the grassland dataset remain visible–sthough not perfectly sharp, they are still distinguishable. In high-gradient regions of the stadium dataset, reconstruction is effective, and distant stadiums in the house dataset are also acceptably reconstructed. In the depth maps, NeRF performs well for regions close to the image plane but becomes blurry at greater distances. NeRF-W shows slight improvements in representing distant building edges. However, HA-NeRF produces poor and inconsistent depth results at image boundaries. In contrast, our method generates relatively sharp depth edges. Even for distant regions, parts of the faraway buildings are reconstructed. Although the corresponding images may appear slightly blurred, the depth maps still capture meaningful structural information.
Ablation studies
We conducted ablation experiments on the added modules using the same dataset, including the following cases: (1) HIF-NeRF (C) uses only CNN-based image encoding to generate images; (2) HIF-NeRF (C+T) combines CNN image encoding with transient features; (3) HIF-NeRF(C+F) generates images using hybrid image encoding; (4) HIF-NeRF (Ours) represents our proposed method. As shown in Table 2 and Fig. 7, in configuration(1), the central regions of the generated images are reasonably well reconstructed, but the reconstruction quality at the edges is poor. Due to the lack of randomness, the generated images appear blurry overall. In configuration(2), adding partial transient features to the CNN-based encoding improves the results to some extent, but blurring at the edges still remains. In configuration(3), the introduction of a transformer enriches the image features, leading to clearer image generation. Configuration(4), our full method, produces the best reconstruction quality among all settings. The full model introduces a moderate increase in training time due to the added modules.
Table 2.
Quantitative results of the ablation study comparing different model variants. PSNR and SSIM measure reconstruction accuracy, while LPIPS assesses perceptual similarity.
| Method | PSNR
|
SSIM
|
LPIPS
|
|---|---|---|---|
| HIF-NeRF (C) | 22.35 | 0.65 | 0.33 |
| HIF-NeRF (C+T) | 23.02 | 0.69 | 0.34 |
| HIF-NeRF (C+F) | 24.3 | 0.66 | 0.29 |
| HIF-NeRF (Ours) | 25.61 | 0.85 | 0.21 |
Fig. 7.
Visualization results of the ablation study, illustrating the effects of removing each component on reconstruction quality.
Conclusion
With the widespread adoption of NeRF in various applications and the increasing complexity and scale of target scenes, achieving high-quality 3D reconstruction under sparse data conditions has become a critical challenge. In this work, we integrate CNNs and Transformers for image feature extraction to assist NeRF in scene modeling, and extend the framework with an MLP to predict visibility, effectively separating static scenes from dynamic objects. Both qualitative and quantitative experimental results on our dataset demonstrate the effectiveness of our approach in reconstructing scenes from drone-captured images. However, there are still several important limitations. First, the model’s performance degrades in regions with sparse textures or repetitive low-frequency patterns. Increasing the number of input images or using higher-resolution images can partially mitigate this issue, but also leads to a significant increase in training time due to the larger data volume. Second, the model is affected by the content of the scene. When input images are captured from widely varying viewpoints, the reduced overlap in visual information can lead to degraded image synthesis quality under novel views. Third, our current image acquisition strategy involves capturing images along a circular trajectory around the target area. Although this ensures consistent multi-view coverage, it cannot effectively handle images captured from horizontal or lateral viewpoints.
In future work, we aim to address these challenges by incorporating additional prior knowledge, leveraging semantic or geometric constraints, exploring more diverse flight patterns, and adopting scene decomposition strategies for targeted reconstruction. These directions may further improve NeRF’s robustness and generalization capabilities, particularly in sparse-view and large-scale aerial scenarios.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 62061002.
Author contributions
Conceptualization, Z.C. and X.C.; methodology, Z.C.; software, Z.C.; validation, Z.C. and C.Y.; formal analysis, Z.C.; investigation, Z.C., C.Y., S.W. and X.W.; data curation, Z.C., C.Y.and S.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C., C.Y. and S.W.; visualization, Z.C. and C.Y.; supervision, Z.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.
Data availability
The raw data supporting the findings of this study include:(1) The Sydney and Court scenes were obtained from the outdoor multi-modal dataset(OMMO), publicly available at: DOI: https://doi.org/10.1109/ICCV51070.2023.00695; and (2) additional self-collected datasets, including The Playground, Grass, and House scenes, were captured and annotated by the authors, which are available from the first author upon reasonable request via email: zehomchan@gmail.com
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Özyesil, O., Voroninski, V., Basri, R. & Singer, A. A survey of structure from motion. Acta Numer.26, 305–364. 10.1017/S096249291700006X (2017). [Google Scholar]
- 2.Shum, H. & Kang, S. B. Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, vol. 4067 of Proceedings of SPIE, 2–13 (SPIE, Perth, Australia, 2000).
- 3.Mildenhall, B. et al. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 405–421, 10.1007/978-3-030-58452-8//24 Springer, Virtual Conference, 2020.
- 4.Zhou, T., Tucker, R., Flynn, J., Fyffe, G. & Snavely, N. Stereo magnification: Learning view synthesis using multiplane images. CoRRabs/1805.09817 (2018).
- 5.Habtegebrial, T., Gava, C., Rogge, M., Stricker, D. & Jampani, V. Somsi: Spherical novel view synthesis with soft occlusion multi-sphere images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15704–15713 (New Orleans, LA, USA, 2022).
- 6.Xie, Y. et al. Neural fields in visual computing and beyond. Comput. Graph. Forum41, 641–676 (2022). [Google Scholar]
- 7.Tewari, A. et al. Advances in neural rendering. Comput. Graph. Forum41, 703–735 (2022). [Google Scholar]
- 8.Jack, D. et al. Learning free-form deformations for 3d object reconstruction. In Computer Vision - ACCV317–333, 2019. 10.1007/978-3-030-20890-5//21 (2018) (Perth, Australia). [Google Scholar]
- 9.Dai, P., Zhang, Y., Li, Z., Liu, S. & Zeng, B. Neural point cloud rendering via multi-plane projection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7827–7836, 10.1109/CVPR42600.2020.00785 (Seattle, WA, USA, 2020).
- 10.Park, J. J., Florence, P., Straub, J., Newcombe, R. & Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 165–174, 10.1109/CVPR.2019.00025 (Long Beach, CA, USA, 2019).
- 11.Barron, J. T. et al. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 5835–5844, 10.1109/ICCV48922.2021.00580 (Montreal, QC, Canada, 2021).
- 12.Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P. & Hedman, P. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5460–5469, 10.1109/10.1109/CVPR52688.2022.00539 (New Orleans, LA, USA, 2022).
- 13.Verbin, D. et al. Ref-nerf: Structured view-dependent appearance for neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5481–5490, 10.1109/CVPR52688.2022.00541 (New Orleans, LA, USA, 2022). [DOI] [PubMed]
- 14.Wang, Z. et al. Neref: Neural refractive field for fluid surface reconstruction and rendering. In 2023 IEEE International Conference on Computational Photography (ICCP), 1–11, 10.1109/ICCP56744.2023.10233838 (Madison, WI, USA, 2023).
- 15.Fujitomi, T. et al. Lb-nerf: Light bending neural radiance fields for transparent medium. In 2022 IEEE International Conference on Image Processing (ICIP), 2142–2146, 10.1109/ICIP46576.2022.9897642 (Bordeaux, France, 2022).
- 16.Fridovich-Keil, S. et al. Plenoxels: Radiance fields without neural networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5491–5500, 10.1109/CVPR52688.2022.00542 (New Orleans, LA, USA, 2022).
- 17.Reiser, C., Peng, S., Liao, Y. & Geiger, A. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 14315–14325, 10.1109/ICCV48922.2021.01407 (Montreal, QC, Canada, 2021).
- 18.Müller, T., Evans, A., Schied, C. & Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph.41, 102:1-102:15. 10.1145/3528223.3530127 (2022). [Google Scholar]
- 19.Yu, A. et al. Plenoctrees for real-time rendering of neural radiance fields. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 5732–5741, 10.1109/ICCV48922.2021.00570 (Montreal, QC, Canada, 2021).
- 20.Garbin, S. J., Kowalski, M., Johnson, M., Shotton, J. & Valentin, J. Fastnerf: High-fidelity neural rendering at 200fps. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 14326–14335, 10.1109/ICCV48922.2021.01408 (Montreal, QC, Canada, 2021).
- 21.Deng, K., Liu, A., Zhu, J.-Y. & Ramanan, D. Depth-supervised nerf: Fewer views and faster training for free. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12872–12881, 10.1109/CVPR52688.2022.01254 (New Orleans, LA, USA, 2022).
- 22.Tewari, A. et al. Advances in neural rendering. Comput. Graph. Forum41, 703–735 (2022). [Google Scholar]
- 23.Chan, E. R. et al. Efficient geometry-aware 3d generative adversarial networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16102–16112, 10.1109/CVPR52688.2022.01565 (New Orleans, LA, USA, 2022).
- 24.Hedman, P., Srinivasan, P. P., Mildenhall, B., Barron, J. T. & Debevec, P. Baking neural radiance fields for real-time view synthesis. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 5855–5864, 10.1109/ICCV48922.2021.00582 (Montreal, QC, Canada, 2021).
- 25.Fridovich-Keil, S., Meanti, G., Warburg, F. R., Recht, B. & Kanazawa, A. K-planes: Explicit radiance fields in space, time, and appearance. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12479–12488, 10.1109/CVPR52729.2023.01201 (2023).
- 26.Hu, T., Liu, S., Chen, Y., Shen, T. & Jia, J. Efficientnerf - efficient neural radiance fields. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12892–12901 New Orleans, LA, USA, 2022. 10.1109/CVPR52688.2022.01256.
- 27.Yu, A., Ye, V., Tancik, M. & Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4576–4585, 10.1109/CVPR46437.2021.00455 (virtual, 2021).
- 28.Chen, A. et al. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 14104–14113, 10.1109/ICCV48922.2021.01386 Montreal, QC, Canada, 2021.
- 29.Xu, D. et al. Sinnerf: Training neural radiance fields on complex scenes from a single image. In Computer Vision - ECCV 2022 - 17th European Conference, 736–753, 10.1007/978-3-031-20047-2_42 (Springer Nature Switzerland, Tel Aviv, Israel, 2022).
- 30.Yen-Chen, L. et al. inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1323–1330, 10.1109/IROS51168.2021.9636708 (Prague, Czech Republic, 2021).
- 31.Wang, Z., Wu, S., Xie, W., Chen, M. & Prisacariu, V. A. Nerf-: Neural radiance fields without known camera parameters. arXivabs/2102.07064 (2021).
- 32.Zhang, K., Riegler, G., Snavely, N. & Koltun, V. Nerf++: Analyzing and improving neural radiance fields. arXivabs/2010.07492 (2020).
- 33.Martin-Brualla, R. et al. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7210–7219, 10.1109/CVPR46437.2021.00713 (virtual, 2021).
- 34.Mildenhall, B., Hedman, P., Martin-Brualla, R., Srinivasan, P. P. & Barron, J. T. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16169–16178, 10.1109/CVPR52688.2022.01571 (New Orleans, LA, USA, 2022).
- 35.Wang, S., Xu, H., Li, Y., Chen, J. & Tan, G. Ie-nerf: Inpainting enhanced neural radiance fields in the wild. arXivarXiv:2407.10695 (2024).
- 36.Chen, X. et al. Hallucinated neural radiance fields in the wild. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12933–12942, 10.1109/CVPR52688.2022.01260 (IEEE, New Orleans, LA, USA, 2022).
- 37.Li, L. et al. Unmanned aerial vehicle-neural radiance field (uav-nerf): Learning multiview drone three-dimensional reconstruction with neural radiance field. Remote Sensing16, 4168 (2024). [Google Scholar]
- 38.Marí, R., Facciolo, G. & Ehret, T. Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1310–1320, 10.1109/CVPRW56347.2022.00137 (New Orleans, LA, USA, 2022).
- 39.Tancik, M. et al. Fourier features let networks learn high frequency functions in low dimensional domains. In Advances in Neural Information Processing Systems, vol. 33, 7537–7547 (Curran Associates, Inc., Virtual Conference, 2020).
- 40.Ramasinghe, S. & Lucey, S. Beyond periodicity: Towards a unifying framework for activations in coordinate-mlps. In Computer Vision - ECCV 2022 - 17th European Conference, vol. 13693 of Lecture Notes in Computer Science, 142–158 (Springer, Tel Aviv, Israel, 2022).
- 41.Snell, J. et al. Learning to generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing (ICIP), 4277–4281, 10.1109/ICIP.2017.8297089 (Beijing, China, 2017).
- 42.Lu, C. et al. A large-scale outdoor multi-modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 7523–7533, 10.1109/ICCV51070.2023.00695 (Paris, France, 2023).
- 43.Schönberger, J. L. & Frahm, J.-M. Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4104–4113, 10.1109/CVPR.2016.445 (Las Vegas, NV, USA, 2016).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The raw data supporting the findings of this study include:(1) The Sydney and Court scenes were obtained from the outdoor multi-modal dataset(OMMO), publicly available at: DOI: https://doi.org/10.1109/ICCV51070.2023.00695; and (2) additional self-collected datasets, including The Playground, Grass, and House scenes, were captured and annotated by the authors, which are available from the first author upon reasonable request via email: zehomchan@gmail.com








































