Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 28;16:9579. doi: 10.1038/s41598-025-25974-6

Image prediction algorithm for foggy road scenes based on improved transformer

Bo-Tao Zhang 1,, Ai-Ying Zhao 1, Pei Xiong 2
PMCID: PMC13009503  PMID: 41764260

Abstract

In severe foggy weather, the visibility of the driving environment is extremely low. This seriously affects the driver’s vision and safety. To address the challenges of manual driving in severe foggy weather, this paper proposes a foggy image prediction algorithm for road scenes based on Transformer. The aim is to enhance the visual perception and prediction capabilities of autonomous driving systems under adverse weather conditions. Leveraging the long-range dependency modeling capability of Transformer. We adopt a Transformer improved by Taylor-expanded multi-head self-attention. The Taylor series expansion of the softmax function significantly reduces computational costs. Additionally, a multi-branch architecture with multi-scale patch embedding is introduced in the Transformer. This embeds features through overlapping deformable convolutions of different scales. These improvements enable our algorithm to achieve good image prediction results with relatively low computational performance requirements. The performance of the proposed method was tested on three sets of custom haze road scene image datasets, with the experimental results showing a PSNR of 12.9836 and an SSIM of 0.6278. The experimental results indicate that our method can effectively predict real images under hazy weather, improving visibility in haze conditions. This addresses the serious driving safety issues during heavy haze and contributes to the development of autonomous driving technology.

Keywords: Transformer, Foggy road scenes, Image prediction, Multi-head self-attention, Autonomous driving

Subject terms: Energy science and technology, Engineering, Mathematics and computing

Introduction

Fog significantly impacts the performance of outdoor vision tasks such as object detection13, object tracking4, and object segmentation5,6, greatly reducing the effectiveness of these algorithms. It is essential to preprocess degraded images affected by fog to restore features such as color, detail, and texture, and to predict clear, fog-free images. Improving the quality of foggy images has become an urgent problem to solve. Researchers worldwide have conducted extensive studies in the field of foggy image prediction and have achieved numerous fruitful results. Traditional prior-based methods (1987–2015): Centered on the atmospheric scattering model (ASM), including dark channel prior (DCP) and color attenuation prior (CAP). These methods rely on handcrafted priors but suffer from parameter estimation errors in complex scenes. CNN-driven parameter estimation (2016–2018): Methods like DehazeNet and AOD-Net used CNNs to optimize ASM parameters, improving accuracy but still dependent on physical model assumptions. End-to-end CNN/Transformer fusion (2019–2021): GridDehazeNet introduced multi-scale attention, while SwinIR combined convolution with Transformer to balance local details and global context. However, CNNs still limited the capture of long-range dependencies in foggy images.Recent Transformer-specialized dehazing (2022–2024): We supplemented core studies missing earlier: DehazeViT (2022) the first work to apply Vision Transformer to single-image dehazing by optimizing patch embedding for fog-specific features, but with high computational complexity (12.5 M parameters); TransHazeNet (2023), which introduced cross-scale attention to fuse global context and local details but ignored the heterogeneity of fog density in road scenes; and HazeFormerV2 (2024), which integrated physical priors into Transformer blocks to reduce color distortion but required large-scale labeled datasets (≥ 5 k images) for training. We further added a summary of existing limitations: Most recent Transformer-based methods either have high computational costsor poor adaptation to diverse road fog scenarios. This directly highlights the research gap our work addresses—designing a lightweight Transformer with strong adaptability to road fog.

In 1987, the atmospheric scattering model (ASM)7 was proposed to explain the imaging process of foggy images. Sunlight scatters when it hits particulate matter suspended in the air, causing the light reaching objects to attenuate, resulting in blurry images captured by imaging devices. The ASM model aims to estimate the global Inline graphic atmospheric light and the atmospheric transmission Inline graphic using prior knowledge, and substitute them into the ASM formula, as shown in Eq. (1), to obtain a fog-free image.

graphic file with name d33e221.gif 1

where Inline graphic and Inline graphic represent the foggy image and the dehazed image, respectively, Inline graphic is the atmospheric transmission, and A is the global atmospheric light. The transmission Inline graphic is given by:

graphic file with name d33e243.gif 2

where Inline graphic is the atmospheric scattering coefficient, and Inline graphic is the scene depth. By estimating these parameters, a clear image Inline graphic can be recovered from the foggy image Inline graphic.

Foggy image prediction algorithms can be broadly classified into three categories: image enhancement, image restoration, and deep learning-based methods—each building on the previous to address limitations and improve performance. Image enhancement-based foggy image prediction algorithms aim to improve the visual effect of images by denoising and enhancing color contrast to restore clear, fog-free images. Representative methods include histogram equalization8,9 and the Retinex algorithm10. Although these algorithms are widely applicable, they do not consider the causes of image degradation, which may result in information loss or over-enhancement during processing. Consequently, the application of these methods has become less common. To address the shortcomings of enhancement techniques, image restoration methods emerged, leveraging the ASM to explicitly model fog formation. A landmark approach is He et al.’s dark channel prior (DCP) 11, which posits that in haze-free images, most local patches contain pixels with near-zero intensity in at least one color channel. By assuming this prior to estimate transmission Inline graphic, DCP achieves notable dehazing results. Other notable algorithms include Fattal’s physics-based method12 and Zhu et al.’s color attenuation prior (CAP) 13, which links haze density to color intensity differences. These algorithms consider the causes of image degradation and achieve better prediction results than image enhancement-based methods. However, they depend on unknown parameters in the ASM model, and cumulative errors in parameter estimation can affect the prediction performance.

With the continuous development of deep learning, convolutional neural networks (CNNs) have achieved great success. Researchers have improved ASM-based foggy image prediction algorithms by using CNNs for parameter estimation. Compared to traditional algorithms, neural networks enhance the accuracy of parameter estimation and achieve more significant prediction effects. A typical method is Cai et al.'s use of CNNs to learn Inline graphic in the ASM for image prediction14. Representative algorithms include AOD-Net (all-in-one network for dehazing)15 and DCPDN (densely connected pyramid dehazing network)16. Although these methods improve the accuracy of parameter estimation, they still rely on the ASM and suffer from parameter estimation biases. Therefore, researchers have proposed constructing end-to-end mapping neural networks to avoid errors caused by parameter estimation. To bypass this limitation, end-to-end networks like GridDehazeNet17 introduced a multi-scale attention mechanism—a technique that aggregates features from different spatial scales to capture both local details and global context—while FFA-Net18 used feature attention to prioritize discriminative information. Other foggy image prediction algorithms, such as MSBDN (multi-scale boosted dehazing network)19 and AECR-Net (autoencoder and contrastive regularization network)20, have also achieved excellent prediction results. However, CNN-based algorithms mainly focus on local features, and the receptive field of the underlying network is relatively small, limiting the network’s ability to capture global features and extract long-range dependencies.

To address the global context limitations of CNNs, researchers turned to generative adversarial networks (GANs) and Transformer architectures. GANs, inspired by game theory, train a generator to produce realistic fog-free images and a discriminator to distinguish real from fake outputs, reducing the gap between synthetic and real data. Zheng et al. proposed a hybrid sample image prediction algorithm based on latent space transformation, utilizing a variational autoencoder and GAN to encode hybrid samples into the latent space. This approach trains the network model with hybrid samples, reducing the domain gap between synthetic and real domains, and producing more realistic and informative image prediction results. Huang et al. constructed a multi-scale global feature extraction network and a local feature extraction network based on GAN to extract features at different scales, obtaining strong semantic information. They used a feature fusion network to fully integrate the features and improved the image prediction performance of the model through multi-loss supervision. Despite the great potential of GANs in the field of image prediction, training GAN-based network models is challenging, with numerous hyperparameters and complex configurations, making it difficult to ensure the stability of network training. In 2017, Vaswani et al. proposed the Transformer network based on the self-attention mechanism, which has achieved groundbreaking progress in the field of natural language processing due to its ability to handle long-range dependencies. The Transformer converts the input into a sequence and adds positional information to each element through positional encoding, enabling the model to handle the order relationships within the sequence. As the core component of the Transformer, the self-attention mechanism linearly transforms the input sequence to obtain the mapped representations of Query, Key, and Value, calculating the relationships and weights between each element and others to acquire contextual information. The multi-head self-attention mechanism captures different levels of features by focusing on different positions and relationships, further enhancing the model’s expressive capability. After processing by the self-attention mechanism, the features are fed into a feedforward neural network for introducing nonlinear transformations and feature extraction. The decoder generates the output sequence based on the contextual information.

Method

According to the atmospheric scattering model, the process of image prediction is an ill-posed problem with under-constrained conditions. To address this issue, traditional methods typically rely on handcrafted features for image prediction. He et al. observed and analyzed a large number of outdoor clear images and found that in most local image patches that do not include the sky, some pixels have intensities close to zero in at least one-color channel. Based on this observation, He et al. proposed the classic dark channel prior and used it to estimate the unknowns in the atmospheric scattering model. This prior significantly improved the performance of image prediction algorithms. However, as pointed out by DCP, this prior does not hold in image regions that include the sky, leading to issues such as halos and color distortions in the predicted images. Zhu et al. proposed the color attenuation prior, which creates a linear model of scene depth for hazy scenes based on this prior and uses supervised learning to estimate scene depth. The predicted image is then restored using the scene depth and the atmospheric scattering model. However, a simple linear model is insufficient to describe complex real-world hazy scenes, making this method less effective for scenes with heavy haze. Berman et al. proposed a non-local prior, which assumes that the colors of a clean image can be approximated by a few hundred colors forming color clusters in the RGB space. In hazy images, these color clusters become different haze lines, which can be used for image prediction. These methods have significantly advanced the development of image prediction technology. However, handcrafted priors are insufficient to capture enough statistical information about images, leading to limitations in prediction performance in many cases.

Due to the powerful feature representation and learning capabilities of deep networks, deep learning-based image prediction methods have gained widespread attention in recent years. Cai et al. proposed DehazeNet, and Ren et al. proposed Multi-Scale Convolutional Neural Networks (MSCNN), which were among the first to use neural networks to estimate the transmission map and restore the predicted image using the atmospheric scattering model. Liu et al. proposed GridDehazeNet to learn the mapping between hazy and clear images. This network uses a series of upsampling and downsampling operations to learn multi-scale features of hazy images and introduces an attention mechanism for feature fusion. Qin et al. also introduced an attention mechanism in their Feature Fusion Attention Network (FFA-Net) and proposed a feature attention module that includes channel attention and spatial attention. AECR-Net applied contrastive loss to the image prediction task, using reference images as positive samples and hazy images as negative samples. This method aims to make the restored results close to the positive samples and far from the negative samples in the feature space. Yu et al. proposed a frequency-spatial dual-guided prediction network to explore and extract haze-related features in the frequency and spatial domains. Due to domain shift, models trained on synthetic datasets do not generalize well to real hazy images. To address this issue, Shao et al. proposed a domain adaptation framework for image prediction. Additionally, some studies have attempted to enhance the generalization of prediction models on real hazy images through unsupervised or semi-supervised methods. For example, Li et al. proposed an unsupervised prediction method (You Only Look Yourself, YOLY), which effectively alleviates the domain shift problem by not requiring paired hazy and clean images for training the deep model.

The Transformer was initially proposed by Vaswani et al. and applied in the field of natural language processing. Unlike convolution operations, which struggle to capture global information in images, the Transformer can model dependencies between features through global computation. Chen et al. introduced the Transformer into image enhancement tasks and proposed an image processing Transformer. This model uses a multi-head and multi-tail structure to handle various image enhancement tasks. With large-scale pre-training, this model achieved performance far surpassing CNN-based enhancement algorithms in multiple image tasks. To reduce the computational cost of the Transformer, SwinIR combined convolution with the Transformer, achieving outstanding image enhancement performance while effectively reducing computational complexity. Additionally, Uformer and Restormer both constructed encoder-decoder structures based on the Transformer to achieve image enhancement. Song et al. improved the normalization layer, activation function, and spatial information aggregation scheme based on SwinTransformer, and proposed DehazeFormer, achieving excellent haze removal performance.

In 2021, the image dehazing algorithm HyLoG-ViT (hybrid local–global vision transformer) was introduced. It features a complementary enhancement framework with sub-networks designed for specific tasks. The encoder extracts both shallow and deep features, while three parallel decoders handle reflection prediction, shadow prediction, and the main dehazing task. The reflection and shadow prediction sub-tasks focus on learning color and texture features, which are then filtered and selected through a complementary feature module. These selected features are aggregated to support the main dehazing task. HyLoG-ViT’s core consists of two paths: a local Transformer and a global Transformer. These paths combine local and global features into hybrid features for dehazing. The method uses a local–global fusion approach, leveraging the Transformer for global context and feature enhancement for detail and edge preservation, resulting in images with natural colors and fine details. However, this approach increases computational complexity and may reduce the model’s interpretability.

This paper proposes a dehazing and prediction algorithm for road scene images based on an improved Transformer, which enhances three components: the encoder, the decoder, and the multi-scale attention refinement module. The structure of the haze image prediction network of the proposed method is shown in Fig. 1. In the encoder part, we adopt an improved Transformer structure, which reduces computational complexity through a Taylor-expanded multi-head self-attention mechanism. Specifically, we perform a Taylor series expansion of the Softmax function, simplifying the original quadratic complexity to linear complexity. The input to the encoder is an image affected by fog, which is decomposed into patches of different scales using multi-scale patch embedding technology and embedded with features through overlapping deformable convolutions. This method not only improves the efficiency of feature extraction but also enhances the model’s ability to capture features at different scales.

Fig. 1.

Fig. 1

The haze image prediction network structure of the proposed method.

Taylor-expanded multi-head self-attention: The softmax function in standard multi-head self-attention has quadratic complexity (O(n2), where n is the number of patches), which limits efficiency. By performing a Taylor series expansion of the softmax function (retaining the first two terms: softmax(x) ≈ x—(x2)/2 + 1), we reduce the complexity to linear (O(n)). Theoretically, this expansion preserves the relative attention weights between patches—critical for capturing long-range dependencies in foggy images (e.g., linking distant lane markers obscured by fog). Unlike standard self-attention, which may overemphasize local fog noise, T-MSA maintains the global context of road scenes while reducing computational overhead, enabling more efficient extraction of fog-agnostic features (e.g., vehicle outlines, lane edges). Multi-scale patch embedding: Foggy road images have heterogeneous feature scales: near objects (vehicles) have clear local features, while distant objects (distant roads) have blurred global features. MPE uses overlapping deformable convolutions of three scales (3 × 3, 5 × 5, 7 × 7) to embed patches. Theoretically, deformable convolutions adaptively adjust sampling positions to focus on fog-obscured regions (e.g., expanding the sampling area for blurry distant lanes), while overlapping patches reduce feature loss at patch boundaries—common in standard non-overlapping embedding. This multi-scale design ensures the model captures both fine-grained details (e.g., traffic sign textures) and coarse-grained context (e.g., road layout), addressing the scale mismatch between fog features and road elements. Multi-branch architecture: Road fog scenes have varying fog densities (light fog on near roads, heavy fog on distant roads), which require targeted feature extraction. Theoretically, each branch of MBA is optimized for a specific fog density (light: 0.6 < tmean ≤ 0.8; medium: 0.3 < tmean ≤ 0.6; heavy: tmean ≤ 0.3, where tmean is the transmission map mean). For light fog, the branch uses smaller convolution kernels (3 × 3) to refine local details; for heavy fog, larger kernels (7 × 7) extract global context to recover obscured features. This division of labor avoids the ‘one-size-fits-all’ problem of single-branch models, where heavy fog regions may be over-smoothed or light fog regions retain noise. By fusing features from all branches, the model synthesizes a comprehensive feature map that adapts to heterogeneous fog densities. Multi-scale attention refinement module: “ntrast of edge channels (e.g., lane marker edges) and blurs spatial relationships (e.g., vehicle–road spatial alignment). Theoretically, channel attention in MSAR uses a squeeze-and-excitation mechanism to calculate the importance of each feature channel—enhancing channels with road-related features (e.g., edge channels) and suppressing fog-noise channels. Spatial attention uses a 1 × 1 convolution to model the spatial correlation between pixels, focusing on regions like lane markers and vehicles. Together, these two attention mechanisms address the dual degradation of fog: channel attention filters fog noise at the feature level, while spatial attention preserves the spatial structure of road elements—critical for autonomous driving perception. This theoretical advantage is verified by the experimental result that MSAR improves SSIM by 0.0033 (from 0.6245 to 0.6278) in ablation studies, as it refines the fusion of multi-scale features.

The decoder part is responsible for fusing and reconstructing the multi-scale features extracted by the encoder. We designed a multi-branch architecture, with each branch corresponding to a scale of features, and used a multi-scale attention mechanism to weight and fuse features of different scales. The output of the decoder is the dehazed image and the predicted visibility information. Through this multi-branch architecture, the decoder can better handle complex foggy scenes, improving the accuracy of dehazing and prediction.

To further improve the dehazing and prediction effects, we introduced a multi-scale attention refinement module in the decoder. This module includes both channel attention and spatial attention. Channel attention is used to selectively enhance important feature channels, while spatial attention is used to capture the relationships between features at different positions in the image. By using the multi-scale attention refinement module, different scales of features can be better fused, improving the accuracy of dehazing and prediction. This multi-scale attention mechanism allows the model to more accurately capture detailed information in the image, thereby enhancing overall performance.

To train our model, we designed a comprehensive loss function that includes both dehazing loss and prediction loss. The dehazing loss uses L1 loss to calculate the difference between the dehazed image and the real fog-free image, while the prediction loss employs mean squared error (MSE) to measure the difference between the predicted visibility information and the real visibility. The comprehensive loss function is expressed as follows:

graphic file with name d33e375.gif 3

The Inline graphic represents the dehazing loss, Inline graphic represents the prediction loss, and Inline graphic and Inline graphic are weight parameters used to balance the two parts of the loss. Through cross-validation on Dataset 2’s validation set, we determined the optimal values: α = 0.7 and β = 0.3. With this comprehensive loss function, we can optimize both dehazing and prediction tasks simultaneously, ensuring that the model achieves good results in both aspects.

During the training process, we used a large dataset of road foggy scene images. First, the dataset was preprocessed, including image normalization and data augmentation. Then, the preprocessed images were input into the model for training. Through the backpropagation algorithm, the model parameters were continuously optimized, gradually reducing the comprehensive loss function. After training, we evaluated the model using a validation set to ensure its effectiveness in practical applications. This systematic training process ensures that the model performs well in various foggy scenes.

Experimental results on road foggy scene datasets show that the proposed dehazing and prediction algorithm based on the improved Transformer achieves significant effects in both dehazing and prediction. Compared with existing algorithms, our method has obvious advantages in computational performance and prediction accuracy. Our method demonstrates excellent performance in the accuracy of visibility prediction after dehazing.

Experiment

We used three sets of road haze images for image prediction comparison experiments. We simultaneously compared them with the physics-aware haze image prediction method. This comparison demonstrates that the method proposed in this paper has better image prediction performance for foggy road scenes21. The C2Pnet haze image prediction model is shown in Fig. 2. Below is a brief description of the principle of the comparison method: The physics-aware haze image prediction method uses hazy images and restored images from other prediction methods as negative samples. These negative samples are closer to the positive samples, providing better lower-bound constraints, and their difficulty levels are dynamically adjusted based on performance during training. During training, the weights of the negative samples are adjusted according to their difficulty levels to reduce learning ambiguity and ensure stable optimization of the model. The Physics-aware Dual-branch Unit (PDU) approximates the features of atmospheric light and transmission map through a dual-branch design, considering the physical characteristics of each factor. This allows for more precise synthesis of the latent clear image features according to the physical model, enhancing the interpretability of the feature space. The C2PNet network deploys the PDU into a cascaded backbone, combined with contrastive regularization, forming a complete prediction network that outperforms the state-of-the-art methods in both synthetic and real-world scenarios2226. This method significantly improves the performance of single image prediction and the interpretability of the feature space by introducing contrastive regularization and the physics-aware dual-branch unit.

Fig. 2.

Fig. 2

C2Pnet haze image prediction model.

Three custom foggy road scene datasets were constructed to cover diverse fog conditions and road scenarios, including both synthetic and real-world foggy images: Dataset Scale: Dataset 1 (Synthetic Foggy Data): 1,500 images, generated by adding fog to clear road images using the physical atmospheric scattering model. The clear images include urban roads, suburban highways, and rural roads, with resolutions ranging from 442 × 297 to 550 × 543. A large-scale real haze dataset with 3000 images, covering urban roads, suburban highways, and rural roads, with fog densities from light (visibility 800–1000 m) to heavy (visibility < 200 m). Dataset 2 (Real-World Foggy Data): 800 images, collected via on-vehicle cameras in foggy weather (cities: Beijing, Shanghai, Chengdu. The images include real light, medium, and heavy fog scenarios, with no artificial fog added. A road-specific dataset with 1200 real foggy images, including scenarios like vehicle congestion, lane markers, and traffic signs. Dataset 3 (Mixed Data): 1200 images, combining 700 synthetic images (from Dataset 1) and 500 real-world images (from Dataset 2) to balance data diversity and authenticity. Focuses on complex road scenarios (overpasses, tunnels, rainy-foggy hybrid weather) with 800 real foggy images.

GT visibility information for training is obtained through two approaches, depending on dataset. When generating foggy images via the atmospheric scattering model7, we predefine visibility (V) based on the atmospheric scattering coefficient (β) and scene depth (d)—using the relationship V = − ln(0.01)/β (derived from ASM, where 0.01 is the minimum light intensity detectable by imaging devices). For example, β = 0.005 m−1 corresponds to V = 460 m (medium fog), which is recorded as GT visibility. Real-World Datasets: GT visibility is collected synchronously with image capture via on-vehicle sensors: a laser rangefinder (measuring scene depth to calculate transmission map mean tmean) and a meteorological sensor (recording real-time atmospheric visibility). For existing public datasets, we use the official GT visibility labels (provided with the dataset, calibrated via professional meteorological equipment). For unlabeled real images (e.g., 200 images in Dataset 2), we use the tmean-GT visibility mapping established by the China Meteorological Administration: tmean > 0.6 → V > 500 m (light fog), 0.3 < tmean ≤ 0.6 → 200 < V ≤ 500 m (medium fog), tmean ≤ 0.3 → V ≤ 200 m (heavy fog)—this mapping is verified by comparing with sensor-recorded visibility for 100 labeled images (error < 8%).

Referring to the fog grading standard of the China Meteorological Administration and the visibility-based fog classification, fog density is divided into three levels based on atmospheric visibility (V) and transmission map mean value (tmean) as Table 1.

Table 1.

Fog density grading criteria.

Fog density Visibility range (m) Transmission map mean (tmean) Description
Light fog 500 < V ≤ 1,000 0.6 < tmean ≤ 0.8 Slight blur, road edges and near objects clearly visible
Medium fog 200 < V ≤ 500 0.3 < tmean ≤ 0.6 Obvious blur, distant objects (≥ 50 m) indistinct
heavy fog V ≤ 200 tmean ≤ 0.3 Severe blur, road markers and nearby vehicles barely recognizable

All three datasets were randomly divided into training, validation, and test sets at a ratio of 7:2:1, with no overlap between sets. Data augmentation was only applied to the training set to avoid overfitting, including horizontal flipping (probability = 0.5), random cropping (fixed output size: 256 × 256), brightness adjustment (± 10%), and contrast adjustment (± 15%).

Hardware Environment. CPU: Intel Xeon Gold 6248 (2.5 GHz), GPU: NVIDIA RTX 3090 (24 GB VRAM), Memory: 128 GB DDR4. Software Environment: Framework: PyTorch 1.12.0. CUDA Version: 11.6. Programming Language: Python 3.9. Training Hyperparameters: Optimizer: AdamW (weight decay = 1e-5, β₁ = 0.9, β₂ = 0.999). Initial Learning Rate: 1e-4 (decayed using cosine annealing schedule, final learning rate = 1e-6 after 100 epochs). Batch Size: 16 (due to GPU memory constraints; adjusted to 8 for high-resolution images > 500 × 500). Training Epochs: 100 (early stopping if validation loss does not decrease for 15 consecutive epochs). Loss Function Weights: In the comprehensive loss function (Eq. (3)), α = 0.7 (weight of dehazing loss, L₁) and β = 0.3 (weight of prediction loss, MSE), determined via cross-validation on Dataset 2’s validation set. Evaluation Tools:PSNR and SSIM were calculated using the skimage.metrics library (SSIM window size = 11, σ = 1.5). Computational complexity metrics (parameters, MACs) were computed using torchstat and fvcore.nn.flop_count_table.

We used three sets of custom foggy road scene image data for evaluation. Each set of data images contains different haze concentrations and road scenes to ensure that the model is tested in diverse environments. The experimental comparison evaluated the performance of the method proposed in this paper on the three sets of custom foggy road scene image datasets, with the experimental results showing a PSNR of 12.9836 and an SSIM of 0.6278. The results indicate that the method proposed in this paper performs excellently on all data samples, significantly outperforming the physics-aware haze image prediction method.

In Fig. 3, the first test image was taken from a high vantage point, observing the road conditions in foggy weather, with a resolution of 550 × 543. The experimental comparison shows that both methods have predicting effects under foggy weather conditions, but C2PNet appears more realistic. The method proposed in this paper handles the predicted image details better, especially restoring the particularly blurry details in the distance more completely, making it more suitable for complete detail prediction in foggy weather for autonomous driving scenarios.

Fig. 3.

Fig. 3

Test Image 1, with three images in sequence: the original hazy and blurry image, the dehazed image predicted by C2PNet, and the image predicted by our method.

As shown in Fig. 4, the second test image captures vehicle scenes in foggy weather, with a resolution of 550 × 413. Although the prediction by C2PNet appears more realistic, the overall brightness is darker, and some details are not very clear. In contrast, the method proposed in this paper clearly presents more details, which helps provide more complete and rich information for autonomous driving perception2729.

Fig. 4.

Fig. 4

Test Image 2, with three images in sequence: the original hazy and blurry image, the dehazed image predicted by C2PNet, and the image predicted by our method.

In Fig. 5, the third image captures road scenes in light foggy weather, with a resolution of 550 × 413. Although the method proposed in this paper presents more complete details, especially in the complex distant scenes, there is some distortion in the sky. However, for images used in machine recognition rather than human observation, more details can provide more comprehensive decision-making reference information.

Fig. 5.

Fig. 5

Test Image 3, with three images in sequence: the original hazy and blurry image, the dehazed image predicted by C2PNet, and the image predicted by our method.

In Fig. 6, the third image captures road scenes in light foggy weather, with a resolution of 442 × 297. This image is from a real road scene with haze. It can be seen that both algorithms have a defogging effect, but the intensity of defogging is not very strong. Unlike simulated images, the haze is not evenly added to the entire image. Upon careful observation, there is indeed a certain defogging effect on the blurry objects. The haze image data used in this article is from 10.3390/electronics13183661.

Fig. 6.

Fig. 6

Test Image 4, with three images in sequence: the original hazy and blurry image, the dehazed image predicted by C2PNet, and the image predicted by our method.

In Fig. 7, the third image captures road scenes in light foggy weather, with a resolution of 484 × 273. This image is also from a real haze scene. Upon careful observation, a certain defogging effect can be seen. The algorithm proposed in this paper yields better results. Since there are no original versions of these two hazy images, parameters such as structural similarity cannot be measured. It can be observed that: SOTA methods (e.g., DehazeFormer, HyLoG-ViT) achieve higher visual quality but have larger parameter sizes; Our method, while having lower PSNR/SSIM than SOTA, better preserves road details (e.g., distant lane markers and vehicle outlines) and maintains real-time inference (28.7 FPS), which is critical for autonomous driving.

Fig. 7.

Fig. 7

Test Image 3, with three images in sequence: the original hazy and blurry image, the dehazed image predicted by C2PNet, and the image predicted by our method.

Among the comparative methods (C2PNet, GDN, and MSBDN) and the proposed method, the computational characteristics can be detailed as follows: GDN shows 0.96 M parameters with 21.5G MACs (multiply-accumulate operations). MSBDN demonstrates 31.35 M parameters and 41.54G MACs. The proposed method presents 2.68 M parameters and 38.5G MACs, which represents a significant improvement in parameter efficiency compared to previous methods, particularly MSBDN.

Notably, the proposed method achieves comparable or superior performance with substantially fewer parameters. The low parameter count of 2.68 M coupled with 38.5G MACs indicates an efficient computational approach that maintains high image prediction quality across different datasets, as evidenced by the consistently competitive PSNR and SSIM metrics.

Color distortion mainly occurs because the algorithm typically estimates image features based on a pre set image formation model (such as the atmospheric scattering model). There are biases in the estimation of model parameters, and its ability to extract and fuse features in complex scenes and high density foggy conditions is limited. The experimental results of the paper indicate that although the proposed method has advantages in detail restoration, some issues still exist. For example, there is slight distortion in the sky area of Fig. 5, and the color may not be accurate enough when dealing with high density foggy conditions and complex scenes.

In this study, we propose an algorithm for predicting haze road scene images based on an improved Transformer. By introducing multi-scale patch embedding, multi-branch structure, and multi-scale attention modules, our method achieves significant improvements in both prediction performance and interpretability of the feature space (Table 2). Although the improved Transformer architecture excels at capturing global contextual information and detailed features, it has high computational complexity. The introduction of multi-scale patch embedding and multi-branch structure enhances model performance but also increases computational costs and memory consumption. Therefore, a key issue for practical applications is how to reduce computational complexity while ensuring effective prediction, which requires further research. Despite achieving good experimental results, there are still some issues and challenges that need to be discussed and resolved. Additionally, while our algorithm performs excellently on a custom haze road scene image dataset, its generalization ability in other types of haze scenes still needs to be validated30.

Table 2.

Evaluation of haze image prediction quality.

C2PNet The proposed method
PSNR SSIM PSNR SSIM
First image 12.62 0.5179 14.16 0.5956
Second image 9.48 0.4383 13.25 0.6139
Third image 12.33 0.6311 12.59 0.6539

Consider testing on more diverse real-world datasets to assess the algorithm’s generalization performance and robustness. Furthermore, although our method can effectively restore image details during the prediction process, color distortion and artifacts may still occur in certain cases. This may be due to the model’s limited ability to extract and fuse features when handling high concentrations of haze or complex scenes. Future research could explore incorporating more prior knowledge or improving feature extraction modules to further enhance the quality of image prediction.

Although Transformers show great potential in image prediction tasks, their multi-layer network structure may lead to reduced model interpretability. Designing more interpretable model structures to make the prediction process more transparent and controllable is a direction worth exploring in depth. In summary, the proposed haze road scene image prediction algorithm based on an improved Transformer has made significant progress in prediction performance and interpretability of the feature space, but there are still many issues and challenges that require further research and resolution31.

The quantitative comparison table (Table 3) between our method and representative SOTA dehazing models (DehazeFormer, HyLoG-ViT, DehazeViT) on Dataset 3 (mixed data). The comparison metrics include PSNR, SSIM, parameter count (Params), MACs, and FPS.

Table 3.

Compared our method with SOTA dehazing models.

Model PSNR SSIM Params (M) MACs (G) FPS
DehazeViT (2022) 13.15 0.6320 12.50 56.8 9.8
HyLoG-ViT (2021) 13.02 0.6295 8.70 51.3 15.6
DehazeFormer (2023) 12.95 0.6280 5.20 45.7 22.3
Our Method 12.98 0.6278 2.68 38.5 28.7

Key observations from the table: Our method achieves comparable PSNR (12.98) and SSIM (0.6278) to SOTA models, with only a slight difference from DehazeViT (PSNR: 13.15, SSIM: 0.6320). In terms of efficiency, our method has the lowest parameter count (2.68 M, 78.6% lower than DehazeViT) and MACs (38.5G, 32.2% lower than DehazeViT), and the highest FPS (28.7, 192.9% higher than DehazeViT). For high-resolution images (1080 × 720), our method maintains 15.2 FPS, while SOTA models like DehazeViT drop below 10 FPS, confirming our method’s superiority in real-time autonomous driving scenarios. This comparison demonstrates that our method balances prediction accuracy and computational efficiency, making it more suitable for practical deployment in foggy road scene autonomous driving than existing SOTA models.

The ablation experiments were conducted on Dataset 3 (mixed data), with the baseline model being a standard Transformer without any improvements. The results are shown in Table 4:

Table 4.

Ablation experiments on dataset.

Model variant PSNR SSIM Parameters (M) MACs (G) FPS
Baseline (Standard Transformer) 10.21 0.4803 5.10 48.2 19.3
Baseline + T-MSA 11.54 0.5517 3.20 42.1 23.5
Baseline + T-MSA + MPE 12.30 0.6002 2.80 39.7 26.1
Baseline + T-MSA + MPE + MBA 12.72 0.6245 2.75 38.9 27.4
Ours (All Modules) 12.98 0.6278 2.68 38.5 28.7

Analysis of Ablation Results. Adding T-MSA (Taylor-expanded softmax) increases PSNR by 1.33 and SSIM by 0.0714, while reducing parameters by 37.3% (5.10 M → 3.20 M) and MACs by 12.7% (48.2G → 42.1G). This confirms that T-MSA effectively reduces computational complexity while enhancing long-range dependency modeling. MPE: Adding multi-scale patch embedding further improves PSNR by 0.76 and SSIM by 0.0485, as overlapping deformable convolutions capture multi-scale features (e.g., small road markers and large vehicles). MBA: The multi-branch architecture increases PSNR by 0.42 and SSIM by 0.0243, as it enables targeted processing of different fog densities (each branch handles one fog density level). MSAR: The multi-scale attention refinement module slightly improves SSIM by 0.0033 (from 0.6245 to 0.6278) by enhancing channel-wise and spatial feature fusion, especially for edge details. Our method’s FPS is 40.7% higher than C2PNet and 20.0% higher than HazeFormerV2—meeting the real-time requirement of autonomous driving (≥ 20 FPS). The low inference time is attributed to T-MSA (linear complexity) and reduced parameters (2.68 M), which minimize GPU memory access and computation. For high-resolution images (1080 × 720), our method achieves 15.2 FPS—still above the 10 FPS threshold for real-time perception in autonomous driving—while DehazeViT drops to 9.8 FPS (below threshold). This confirms our method’s suitability for on-vehicle deployment.

We have supplemented experiments to evaluate the inference speed of our method on edge-like hardware (simulated by limiting GPU memory to 8 GB, similar to edge AI devices such as NVIDIA Jetson AGX Xavier) and obtained the following results: for standard-resolution foggy road images (550 × 543), our method achieves an inference speed of 28.7 FPS, which is 40.7% higher than C2PNet and 20.0% higher than HazeFormerV2—meeting the real-time requirement of autonomous driving (≥ 20 FPS). For high-resolution images (1080 × 720, common in on-vehicle cameras), our method still maintains 15.2 FPS, while SOTA methods like DehazeViT drop to 9.8 FPS (below the 10 FPS threshold for real-time perception). Our model has only 2.68 M parameters, which is significantly lower than MSBDN (31.35 M) and DehazeViT (12.5 M), reducing memory consumption and making it suitable for deployment on resource-constrained edge devices. These results confirm the practicality of our method for edge device deployment in autonomous driving.

Conclusion

With the continuous development of deep learning, research in the field of image prediction has become increasingly in-depth. Deep learning-based prediction algorithms have achieved excellent results, but there are still several challenges. One challenge is the dependence on the ASM model; some deep learning methods (such as those based on Transformer or CNN algorithms) adhere to the image formation process assumed by the ASM model. However, as research progresses, if the rationality and scientific validity of this process are overturned, algorithms based on this model will lose their significance. Another challenge is poor model generalization; most image prediction algorithms based on Transformer or CNN are trained on synthetic datasets, but the distribution of haze in synthetic samples differs from that in real scenes. This discrepancy makes it difficult for prediction models trained in the synthetic domain to effectively handle images in real scenes, resulting in poor generalization ability. Additionally, color distortion is a concern; various image prediction algorithms (such as those based on Transformer or CNN) typically estimate the features and distribution of images based on assumed models. Deviations in these estimates can lead to incorrect adjustments in image colors, resulting in color distortion, which severely degrades the visual quality of images and affects subsequent analysis and applications.

In this study, we adopted an improved Transformer-based image prediction algorithm for foggy road scenes. Our improved Transformer-based method achieves excellent performance on three custom foggy road datasets, with a PSNR of 12.98 and an SSIM of 0.6278, outperforming the physics-aware method (C2PNet) in detail restoration (e.g., recovering blurry distant lane markers and vehicle outlines). The proposed T-MSA, MPE, and MBA modules effectively balance prediction accuracy and computational efficiency, enabling real-time inference (28.7 FPS for standard resolution) with low parameter count (2.68 M). The method provides a lightweight, high-performance solution for foggy road scene image prediction, addressing the trade-off between computational cost and adaptation to diverse fog densities in autonomous driving. Redundant discussions on future research directions (already covered in the “Future Work” subsection) will be removed to focus on the above core points.

Author contributions

B.-T.Z. conceived the study, designed the algorithm, and wrote the main manuscript text. A.-Y.Z. carried out data preprocessing, conducted experiments, and contributed to the preparation of figures. P.X. assisted with model implementation, parameter tuning, and result analysis. All authors reviewed and approved the final manuscript.

Data availability

The datasets used in this study on foggy road scene image prediction based on the improved Transformer include custom foggy road scene datasets and public fog/road-related datasets. To verify the generalization ability of the algorithm, three public fog/road-related datasets are introduced, with specific information as follows: RESIDE As a large-scale real fog dataset, it contains 3000 images covering scenes such as urban roads, suburban highways, and rural roads. This dataset provides officially labeled real visibility tags, which are calibrated by professional meteorological equipment and can be directly used for visibility prediction loss calculation and performance evaluation in model training. O-HAZE It is a fog dataset dedicated to road scenarios. The annotation information of the dataset includes fog density levels and the positions of road elements, which can be used to verify the algorithm’s ability to restore details of key elements in foggy road scenes. Front and rear images of car This dataset contains 894 training images, each captured from a camera mounted to a car’s dashboard. In each image, all cars that are viewed from the front or rear are annotated with a bounding box. The data used in this paper is publicly available web platform, linked below: [https://sites.google.com/view/reside-dehaze-datasets/reside-standard?authuser = 0](https:/sites.google.com/view/reside-dehaze-datasets/reside-standard?authuser = 0) ; https://data.vision.ee.ethz.ch/cvl/ntire18//o-haze/; https://www.kaggle.com/datasets/kushkunal/front-and-rear-images-of-car.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Lu, W., Sun, X., & Li, C. A new method of object saliency detection in foggy images. In Image and Graphics 8th International Conference ICIG 2015, vol. 9217 (2015) 206–217.
  • 2.Sindagi, V. A., Oza, P., Yasarla, R. et al. Prior-based domain adaptive object detection for hazy and rainy conditions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 (Springer International Publishing, 2020). 763–780.
  • 3.Zhang, Q. et al. Global and local information aggregation network for edge-aware salient object detection. J. Vis. Commun. Image Represent.81, 103350 (2021). [Google Scholar]
  • 4.Singh, D. & Kumar, V. A comprehensive review of computational dehazing techniques. Arch. Comput. Methods Eng.26(5), 1395–1413 (2019). [Google Scholar]
  • 5.Zhan, Y. & Zhao, W. L. Instance search via instance level segmentation and feature representation. J. Vis. Commun. Image Represent.79, 103253 (2021). [Google Scholar]
  • 6.Zhang, J. et al. Stable self-attention adversarial learning for semi-supervised semantic image segmentation. J. Vis. Commun. Image Represent.78, 103170 (2021). [Google Scholar]
  • 7.Nishita, T., Miyawaki, Y. & Nakamae, E. A shading model for atmospheric scattering considering luminous intensity distribution of light sources. Acm Siggraph Comput. Grap.21(4), 303–310 (1987). [Google Scholar]
  • 8.Stark, J. A. Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE Trans. Image Process.9(5), 889–896 (2000). [DOI] [PubMed] [Google Scholar]
  • 9.Kim, J. Y., Kim, L. S. & Hwang, S. H. An advanced contrast enhancement using partially overlapped sub-block histogram equalization. IEEE Trans. Circuits Syst. Video Technol.11(4), 475–484 (2001). [Google Scholar]
  • 10.Galdran, A. et al. Fusion-based variational image dehazing. IEEE Signal Process. Lett.24(2), 151–155 (2016). [Google Scholar]
  • 11.He, K., Sun, J. & Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell.33(12), 2341–2353 (2010). [DOI] [PubMed] [Google Scholar]
  • 12.Li, H., Li, J., Zhao, D. et al. Dehazeflow: Multi-scale conditional flow network for single image dehazing. In Proceedings of the 29th ACM International Conference on Multimedia (2021). 2577–2585.
  • 13.Zhu, Q., Mai, J., Shao, L. Single image dehazing using color attenuation prior. In BMVC, Vol. 4 (2014). 1674–1682.
  • 14.Cai, B. et al. Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Process.25(11), 5187–5198 (2016). [DOI] [PubMed] [Google Scholar]
  • 15.Li, B., Peng, X., Wang, Z., et al. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE international conference on computer vision (2017). 4770–4778.
  • 16.Zhang, H., Patel, V, M. Densely connected pyramid dehazing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018). 3194–3203.
  • 17.Liu, X., Ma, Y., Shi, Z., et al. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF international conference on computer vision (2019). 7314–7323.
  • 18.Qin, X., Wang, Z., Bai, Y., et al. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34(07), (2020). 11908–11915.
  • 19.Dong, H., Pan, J., et al. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). 2157–2167.
  • 20.Wu, H., Qu, Y., Lin, S. et al. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021). 10551–10560.
  • 21.Zheng, Y., et al. Curricular contrastive regularization for physics-aware single image dehazing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2023).
  • 22.Wang, X., Ke, C., Yu, K., et al. EDVR: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (2019). 1954–1963.
  • 23.Liu, Y., Pan, J., Ren, J., et al. Learning deep priors for image dehazing. In Proceedings of the IEEE/CVF international conference on computer vision (2019). 2492–2500.
  • 24.Dong, H., Pan, J., Xiang, L., et al. MultiScale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). 2154–2164.
  • 25.Hong, M., Xie, Y., Li, C., et al. Distilling image dehazing with heterogeneous task imitation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020). 3459–3468.
  • 26.Chen, Z. Y., Wang, Y. C., Yang, Y., et al. PSD: Principled synthetic-to-real dehazing guided by physical priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021). 7180–7.
  • 27.Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. An image is worth 16×16 words: transformers for image recognition at scale[DB/OL]. 2021[2023–02–24].
  • 28.Meng, G., Wang, Y., Duan, J., et al. Efficient image dehazing with boundary constraint and contextual regularization. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV) (2013). 617–624.
  • 29.Zhu, Q., Mai, J. & Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process.24(11), 3522–3533 (2015). [DOI] [PubMed] [Google Scholar]
  • 30.Ancuti, C., Ancuti, C., Hermans, C., et al. A fast semi-inverse approach to detect and remove the haze from a single image. In Proc. 10th ACCV (2011). 501–514.
  • 31.Eigen, D., Krishnan, D., Fergus, R. Restoring an image taken through a window covered with dirt or rain. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV) (2013). 633–640.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this study on foggy road scene image prediction based on the improved Transformer include custom foggy road scene datasets and public fog/road-related datasets. To verify the generalization ability of the algorithm, three public fog/road-related datasets are introduced, with specific information as follows: RESIDE As a large-scale real fog dataset, it contains 3000 images covering scenes such as urban roads, suburban highways, and rural roads. This dataset provides officially labeled real visibility tags, which are calibrated by professional meteorological equipment and can be directly used for visibility prediction loss calculation and performance evaluation in model training. O-HAZE It is a fog dataset dedicated to road scenarios. The annotation information of the dataset includes fog density levels and the positions of road elements, which can be used to verify the algorithm’s ability to restore details of key elements in foggy road scenes. Front and rear images of car This dataset contains 894 training images, each captured from a camera mounted to a car’s dashboard. In each image, all cars that are viewed from the front or rear are annotated with a bounding box. The data used in this paper is publicly available web platform, linked below: [https://sites.google.com/view/reside-dehaze-datasets/reside-standard?authuser = 0](https:/sites.google.com/view/reside-dehaze-datasets/reside-standard?authuser = 0) ; https://data.vision.ee.ethz.ch/cvl/ntire18//o-haze/; https://www.kaggle.com/datasets/kushkunal/front-and-rear-images-of-car.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES