Fidelity assessment of synthetic images with multi-criteria combination under adverse weather conditions

Alexandra Duminil; Sio-Song Ieng; Dominique Gruyer

doi:10.1038/s41598-025-15480-0

. 2025 Sep 25;15:32769. doi: 10.1038/s41598-025-15480-0

Fidelity assessment of synthetic images with multi-criteria combination under adverse weather conditions

Alexandra Duminil ¹, Sio-Song Ieng ^1,^✉, Dominique Gruyer ¹

PMCID: PMC12464247 PMID: 40998885

Abstract

With the development of AI-based perception algorithms using cameras, access to large and representative datasets is crucial. For autonomous driving systems, it is essential to use road context datasets covering the entire Operating Design Domain, including various road configurations, driving scenarios, and weather conditions. In this context, it is imperative to propose mechanisms and metrics that allow quantifying the fidelity and the level of representativeness of these simulated datasets, in order to evaluate and validate their usability for training and evaluation stages. In this paper, we propose an objective and multi-modal approach allowing to calculate 4 scores representing several aspects of the synthetic image fidelity. These scores address local, global and statistical texture analysis. In addition, a multi-criteria approach, based on evidence theory, is proposed to merge these scores to obtain a final global score. The result is the generation of a global score along with the uncertainty and conflict quantification. This method has been applied on a large number of real and virtual datasets in different weather conditions (clear, rainy, foggy). The initial results are promising and confirm the interest and the relevance of this method.

Keywords: Synthetic image fidelity, Image analysis, Multi-criteria combination

Subject terms: Computer science, Scientific data

Introduction

The rapid development of connected and automated vehicles requires an increasing number of onboard sensors to ensure effective environmental perception and, most importantly, the safety of vehicle passengers and other road users, particularly the vulnerable ones. The algorithms processing the sensor data are equally crucial, representing key components for the deployment of the new generation of mobility means and services. Typically, these algorithms leverage machine learning methods, particularly deep learning models, which require large amounts of sensor data and ground truths for training. However, acquiring real-world data and annotating data covering ODD are both time-consuming and expensive. Additionally, controlling environmental conditions such as traffic state, lighting, and weather are not feasible, making it challenging to cover all possible scenarios that an automated vehicle might encounter. Using solely real-world data risks omitting critical conditions. To address these limitations, synthetic data generated through simulation offers a practical and cost-effective solution. These simulations try to control environments and to reach repeatability and reproducibility constraints, ensuring diverse data that covers a wide range of situations. Nevertheless, the challenge lies in ensuring the realism and fidelity of synthetic data compared to real-world data. Many efforts have been made to create larger and more realistic synthetic datasets using graphical engines, advanced simulation platforms^1,2 or algorithms that adapt the synthetic to the real domain. However, a notable disparity remains between synthetic and real datasets, specifically called “domain gap”. This gap includes two key aspects: appearance and content differences. While appearance refers to object aspect such as color texture or lighting, content refers to the scene layout. To reduce this gap, synthetic datasets need to be as faithful to reality as possible. Following this, it would be relevant to quantify this level of fidelity in order to use these datasets with confidence for the training of deep-learning algorithms.

The PRISSMA project, part of the French Grand Challenge on AI, addresses key challenges related to system validation. Its objectives include developing a methodology for validating and evaluating systems of systems and AI-based systems, as well as verifying the tools and models used in evaluation processes. To enhance efficiency, PRISSMA also proposes automating the simulation chain, from data and scenario generation to execution and analysis. Our research contributes to this effort by proposing a methodology and metrics to evaluate the quality and fidelity of synthetic data used in AI system validation. The main challenge we address is developing a scoring system that ensures generated simulation data are sufficiently realistic, from the perspective of onboard vehicle sensors, to be used in the learning, evaluation, and validation of AI-based perception algorithms.

This work is a continuation of our previous work³, which focused on clear weather conditions in order to validate the relevance and the performance of our method. This work has been extended to various weather conditions, allowing us to determine whether the datasets are faithful enough with respect to the selected features. Based on Haralick et al.’s⁴ works, we decided to analyze the texture and frequency features of both synthetic and real images. The aim is to assess the fidelity of synthetic images using real images as references. Specifically, texture information is extracted from images by employing a statistical texture analysis method known as the grey level co-occurrence matrix (GLCM)⁴. The GLCM allows to assess image structural properties of spacial relation between pixels, through a set of 14 Haralick metrics. In addition to this global texture analysis, the local binary pattern (LBP) method have also been incorporated, which provides a more localized texture analysis. Additionally, we have decided to use the discrete cosine transform to handle high-frequency information. These techniques are used as inputs for a Convolutional Neural Network (CNN)-based classification approach. The objective is to ascertain the level of fidelity of synthetic images, with respect to the three discussed texture and frequency features. These features provide insight into different aspects of image fidelity. Afterwards, a statistical method is proposed, involving the Haralick metrics. In contrast and as a complement to the CNN approaches, we derive fidelity scores with a closed-form equation from the Haralick metrics. The experiments are conducted on data collected in an urban area, under clear, rainy and foggy weather conditions. Moreover, a downstream method is proposed to consolidate the multiple score approach with a Belief Function theory-based multi-criteria method. This approach handles information sources, uncertainties and the possible conflicts between scores, providing a comprehensive assessment of synthetic data fidelity. In addition, unlike previous application-specific methods, its approach is intended to be generic and comprehensive, and applicable to a variety of data sets.

Related works

Several approaches have been proposed to address fidelity in images and domain gap adaptation. Below, the most representative works and compare them with our proposed method are discussed.

Ye et al.⁵ proposed a comprehensive framework of fidelity divided it into objective (physical and functional) and subjective (sensory, conceptional, emotional) components. They highlighted that while several methods, both qualitative and quantitative, have been proposed to measure fidelity, most are biased and specific to particular applications. Building on this clear framework and definition, we have structured our study, defining fidelity as the similarity between virtual and real features. High fidelity implies faithful representation of reality, while low fidelity suggests a simpler one. Some work on fidelity are primarily application-based and can be likened to content domain gap works. Some of them focus on quantifying the simulation-to-reality gap (S2R) in sensor models for object detection^6–9. They assess sensor performance by comparing processed data, such as point clouds and bounding boxes by algorithm in real and simulated situations, and measure the disparity between simulation and reality, validating sensor models. Reway et al.⁶ conducted a comparison between an environment simulation software and real-world test drives, evaluating the gap between simulation and reality using metrics such as Precision, Recall, MOTA, and MOTP applied to object lists from both domains. Prabhu et al.⁷ proposed a domain adaptation via Conditional Alignment and Reweighting (CARE), to systematically leverage target labels in order to explicitly reduce the gap between simulated and real domain but does not offer scores or metrics. Ngo et al.⁸ aimed to evaluate the fidelity of typical radar model types and their applicability for virtually testing radar-based multi-object tracking with a multi-level testing method. Huch et al.’s⁹ main objective of the work is to quantify the simulation-to-real domain shift by analyzing point clouds at the target level by comparing real-world and simulated point clouds within the 3D bounding boxes of the targets. However, these works was carried out in the context of a specific application, namely object detection and tracking using digital models. Other studies are more related to appearance domain gap^10,11. A recent one by Valdebeniton et al.¹² introduced a comprehensive exploration of metrics and datasets for evaluating the fidelity of GAN-generated images. Among them, the classical SSIM, PSNR, FID, IS, KID or LPIPS are mentioned. An other approach¹⁰ introduced a means to assess the quality levels of synthetic underwater images. This is achieved by extracting three feature-based measures-statistical, perceptual, and texture-based-from a transmission map. This work is related to ours since the mentioned method focuses on determining the faithfulness of underwater images it produces, whereas our objective is to quantify the level of fidelity in computer-generated images. Gadipudi et al.¹¹ proposed a method to estimate the S2R gap by computing the Euclidean distance between various real and synthetic datasets. It employs feature embedding methods to extract pertinent features. However, this method requires the use of existing real datasets to obtain a gap value. Other methods rely on vision transformers for feature extraction and texture analysis, applied to tasks such as pattern recognition¹³ and super-resolution¹⁴. Their strong feature representation capabilities allow them to capture fine-grained visual details. While no standardized metrics currently exist, these transformer-based approaches show promising results in texture-related tasks.

Methodology and features extraction

The realism of an image is a complex notion to quantify but it is reasonable to think that it is a combination of many aspects, some of which are measurable features. To decide whether a synthetic image is realistic enough, it is important to examine these different features. A feature-based analysis of Computer-Generated Images (CGI) coming from different synthetic datasets is conducted in order to study and quantify their level of fidelity. Considering that images contain both spatial and frequency information, we decided to extract both global (GLCM) and local (LBP) texture features to capture spatial information, as well as frequency features (DCT) to represent frequency characteristics from the synthetic images. These features are the inputs of the learning-based models, the first presented method. Each model provides a score measuring the fidelity in relation to a given characteristic. Secondly, a statistical method is performed using Haralick metrics, which focuses on information derived from image textures. This second approach also follows the same idea: measuring the contribution of each metrics to decide if the synthetic image is realistic enough. Fig. 1 presents the diagram of the proposed method with four scores based on the features presented above. The fusion of these scores, based on the belief function theory, will provide a global fidelity score and a deep insight on the uncertainty and possible conflict between these features.

Fig. 1 — Diagram of the proposed method. The input image comes from the vKitti dataset¹⁵. This diagram has been made by the authors of the manuscript. https://europe.naverlabs.com/proxy-virtual-worlds-vkitti-2/.

Datasets used in the study

In the context of this research work, the use of datasets is a critical stage and should be selected carefully for the models training, validation and testing. To guarantee the quality of this dataset, we use the following requirements for its constitution: the dataset must involve both real data and synthetic data and if possible, pairs of images which represent the same scenes. The dataset should cover different weather conditions and different types of infrastructure. Thus, different free datasets are used in this work, including real and synthetic images in three different weather conditions (clear, foggy and rainy), as shown in Table 1. For the foggy and rainy datasets, some images are augmented with synthetic perturbations (see the third row, half virtual).

Table 1.

Datasets used in the study.

Image type	Clear weather	Foggy weather	Rain weather
Real	Kitti¹⁶ Cityscapes¹⁷ Once¹⁸ Nuscenes¹⁹	DAWN²⁰ RTTS	Rain in Drive²¹ (RID) Adverse Conditions Dataset with Correspondences (ACDC)²²
Virtual	Virtual Kitti 2¹⁵ Kitti CARLA²³ Synthia²⁴ GTA V²⁵ Improved GTA V²⁶	Virtual Kitti 2¹⁵	Synthia²⁴
Half virtual		FoggyCityscapes²⁷	RainCityscapes²⁸

Open in a new tab

All datasets used in this study are publicly available and do not contain personally identifiable information. The images used have been verified and individuals cannot be identified directly or indirectly. Furthermore, the real datasets serve as reference benchmarks, as the primary objective is to evaluate the fidelity of the synthetic datasets.

Local and global texture feature extraction

Texture is naturally present in images, and it is reasonable to assume that there is a significant difference between the textures of CGI and those of real images. To verify this, we decided to use the GLCM as a means to extract image textures globally, and the LBP method to extract these texture patterns locally. The GLCM technique provides a spatial relationships between pixel intensities and a valuable insight into the co-occurrence of pixels within image regions. Previous studies have successfully used this method to distinguish between GAN-generated images and real images based on their co-occurrence patterns, leading to very interesting results²⁹. Inspired by this method, an analysis of the pixel co-occurrence levels was performed on images from both real and synthetic datasets. This analysis shows that these two types of images have different characteristics. Among the differences, one can classify the GLCMs into two categories: the Continuous Diagonal GLCM (CDG) when every gray-tones are present in the image and the Discontinuous Diagonal GLCM (DDG) when some gray-tones are absent in the image resulting a cartoon like picture. The fine textural characteristic is enhanced by high frequency values close to the diagonal of the GLCM and on the contrary the coarse textures and edges are enhanced by high frequency values far from the GLCM’s diagonal.

Fig. 2 illustrates the observed categories of GLCMs (CDG and DDG) with the average GLCM of synthetic and real images, under different weather conditions. The glcm values appear in the form of spots, which may or may not be uniformly distributed along the diagonal. The size of these spots is delimited by d in the figure. In particular Inline graphic denotes the distance between the diagonal and a point in an area where the values are low and which represents the coarse texture features. From the observations made we can establish the following hypotheses:

The discontinuities along the diagonal of certain GLCM, represented by the presence of several distinct spots, are noticeable in synthetic datasets such as Synthia and vKitti under clear weather conditions, Rain Synthia for rainy conditions and Foggy vKitti for foggy conditions. On the other hand, a continuous distribution of values along the diagonal of the GLCM of the Cityscapes, ACDC and Dawn datasets is observed. Specifically, some values appear to be concentrated in the upper part of the diagonal, such as the GLCM obtained from the real clear weather datasets. The GLCM of the real datasets under degraded conditions shares similarities to those in clear conditions, but with more distributed values along the diagonal. These continuous distribution of values implies a strong presence of co-occurence in real images. These observations show that the classification of real and synthetic images by the GLCM is achievable and the use of a neural network is relevant.

Fig. 2 — The average GLCM was computed on hundreds of synthetic images coming from various selected datasets on the R, G, B channels. We can classify the GLCM into two main categories: Continuous Diagonal GLCM (CDG) and Discontinuous Diagonal GLCM (DDG). Some GLCMs computed with the datasets are shown. In the CDG, only GLCM from real image datasets are shown but GLCMs computed with GTA V dataset are similar to these examples. In the DDG, GLCM are computed from virtual image datasets only. The discontinuity comes from the reduced number of grayscale and the images have cartoon like textures.

Local Binary Pattern³⁰ is a texture descriptor used to compute the local texture representation of a grayscale image. This is achieve by comparing each pixel with its surrounding neighbors to determine whether they have a greater or lower value than the center pixel. This comparison produces a binary output, where a pixel is assigned a value of 1 if it is greater than or equal to the center pixel, and 0 if it is lower. The equation of LBP is defined as follow :

where P the number of neighborhood pixels. Inline graphic is the intensity value of the central pixel, is the intensity of the neighboring pixel with index p, and s is a threshold s(x) function equal to 1 if and equal to 0 if .

Fig. 3 presents the visualization of the LBP computation on images coming from both the Kitti and vKitti datasets. A circle pattern with a radius of 3 is employed for the computation. A noticeable distinction can be observed between the LBP of the images captured from the digital twins datasets Kitti (real) and vKitti (synthetic). Specifically, there is a clear disparity in the sky region and the shading of the trees. This method appears to be a promising approach for pre-processing the input to a CNN with the aim to distinguish the characteristics associated with real images and synthetic images. By incorporating LBP as a pre-processing step, the CNN can potentially capture and learn features that distinguish real from synthetic images.

High-frequency feature extraction

Analyzing the high frequencies in both real and synthetic images can provide insights into the textures present in these types of images. The Discrete Cosine Transform (DCT) technique is employed for this purpose, enabling the transformation of the real data from the spatial domain to the frequency domain. It shares similarities with the Fourier transform but is simpler and more computationally efficient. The DCT equation, which computes the Inline graphic entry of the DCT of an image³¹, is defined as follows :

where p(x, y) is the Inline graphic element of the image p and N is the size of the block that the DCT is done on.

Fig. 3 presents the representation of both LBP and DCT maps applied in both real and synthetic images from the Kitti and vKitti datasets. Although it is not possible to distinguish directly between the representations of the DCT coefficient, the map obtained from the LBP calculation reveals differences between the synthetic and the real versions. In particular, variations in the shadows on the road and the sky are especially noticeable.

Haralick metrics

The use of Haralick metrics provides an alternative to the learning-based models employed across various tasks today. The objective is to assess the degree of fidelity by computing these metrics from simulated images. Images consist of both tonal variations and textural elements, with one often prevailing over the other, depending on the specific features of the image. The predominance of either can significantly influence the results. Therefore, comparing metrics on two completely different images may not be useful, as the variations in tonal and textural characteristics may be significantly different. When using the Haralick metrics for fidelity quantification, it may be beneficial to compute the metrics on small patches to obtain textures of isolated regions such as roads or vegetation to ensure a more meaningful and localised comparison of the Haralick metrics. The Haralick metrics used in this study are: Angular Second Moment (ASM), Contrast, Correlation, Sum of Squares: Variance (Var), Inverse Difference Moment (IDM), Sum Average (SA), Sum Entropy (SE), Entropy (E), Difference Variance (Dvar), Difference Entropy (DE), Information Measure of Correlation 1 (IMC1), Information Measure of Correlation 2 (IMC2). This statistical approach is relevant and interesting because it allows to generate a set of metrics representing different aspects and features of a texture. The physical meaning on textures for each metric is:

Energy or Angular Second Moment (ASM): This metric provides the sum of the squared probabilities in the co-occurrence matrix and measures the regularity or repetition in the texture. A high value indicates a uniform texture or a texture with repeated patterns (e.g., periodic or homogeneous structures). A texture with minimal variations between pixels results in a concentrated co-occurrence matrix, leading to higher energy.
Contrast: This metric reflects the intensity of transitions between neighboring pixels. A high value corresponds to a texture with significant variations in grayscale levels (e.g., sharp-edged patterns or rough textures).
Correlation: Provide a measure of the linear dependence between pixels in a region. A high correlation indicates that neighboring pixels tend to have similar grayscale levels, reflecting a regular and structured texture.
Variance (Var): Measure the grayscale dispersion around the mean. A high variance indicates more pronounced variations in the texture around the average grayscale levels (e.g., irregular patterns or frequent color shifts).
Homogeneity or Inverse Difference Moment (IDM): Provide a measure of local similarity in the texture. A homogeneous texture has similar grayscale levels between neighboring pixels, resulting in values concentrated near the diagonal of the co-occurrence matrix. High homogeneity corresponds to smooth textures. The normalized IDM provide a Weighting of the contributions according to the squared distance between grayscale levels. This normalized metric also evaluates regularity but penalizes larger differences in grayscale levels more heavily. It captures the smoothness in the transition of levels, emphasizing the continuity of texture.
Sum Average (SA): Provides an estimate of the overall average intensity in the image, focusing on textures with frequent combinations of grayscale levels.
Entropy (E): this metric provide a measure of disorder or randomness in the co-occurrence matrix (complexity of the texture). A high entropy indicates that grayscale levels are distributed randomly and are unpredictably (e.g., in a rough or complex texture). Low entropy means a simple and regular texture is used.
Sum Entropy (SE): Measures the entropy associated with the sums of levels. It evaluates the complexity or diversity of grayscale level combinations. A high value reflects a complex texture.
Difference Entropy (DE): Measures the complexity of variations in grayscale level differences. A high value reflects an unpredictable texture with complex variations in grayscale differences.
Sum Variance (SVar): It measures the dispersion of the distribution of the sums of grayscale levels in a texture. A high value of SVar indicates a texture where the distribution of grayscale levels is wide, with varied sums of i+j. A low value indicates a texture where the grayscale levels cluster around specific values, with little dispersion in their sums.
Difference Variance (Dvar): The Difference Variance (DVar) measures the dispersion of grayscale differences between neighboring pixels in an image. It evaluates how much the differences vary around their mean, reflecting the roughness or complexity of the texture. High DVar means that the differences between neighboring grayscale levels are highly varied and spread over a wide range of values. This corresponds to a texture with significant variations, such as sharp edges or abrupt contrast changes. Rough or complex textures typically result in high DVar values. Low DVar value means that the differences between neighboring grayscale levels are relatively constant and close to the mean. This corresponds to a more uniform or smooth texture, where transitions between grayscale levels are less pronounced and more regular.
Information Measures of Correlation (IMC1 and IMC2): IMC1 evaluates the shared information between grayscale levels in both horizontal and vertical directions in a co-occurrence matrix. The higher the correlation between neighboring pixels, the closer this measure will be to zero. A high value of IMC1 suggests low dependence between pixels, which is typical of complex and disordered textures. IMC2 version is more sensitive to complex and nonlinear dependencies in the texture. A higher IMC2 value also indicates low dependence between grayscale levels.

In a synthetic way, these metrics of Haralick provide information about textures attributes like the complexity, the noise, the Repetitiveness and regularity, the grayscale transitions, the entropy (disorder), the artefacts sensitivity (shadows, occlusions, reflection, ...), the number of grayscale, the scale and resolution, and the directionality (correlation).

Architecture implementation and experiments

The first part of the experiment consists in proposing the implementation of an architecture built with three sub-networks (AI-based) that utilize image texture and frequencies as inputs (GLCM, LBP and DCT). These networks are used for binary classification tasks, determining the fidelity of images by producing a probabilistic result between 0 and 1. Values closer to 0 indicate a higher likelihood of not faithful data, whereas values closer to 1 suggest images that are faithful to reality. These learning-based methods give us three fidelity scores. A fourth fidelity score is obtained with a statistic-based method through a linear combination of the selected Haralick metrics contributions. These scores are detailed in the two following sub-sections.

Learning-based method

Using the learning-based method, three sub-networks are proposed, namely XCross-GlNet, XDCT-Net and XLoPB-Net. These three networks are separately trained in a supervised manner on custom datasets with the pre-trained Xception model³² and using a transfer learning technique. The choice of this technique was done to leverage the advantages offered by pre-trained models, specifically: Computational efficiency, powerful feature extractors, or accuracy improvement.

Fig. 4 gives an overview of the proposed training approach with the three networks and the three training stages dedicated to each weather condition. The aim is to estimate the fidelity scores of synthetic datasets containing images, predominantly of urban scenes, captured under clear weather conditions, as well as in fog and rain conditions. To achieve the building of the models, the curriculum learning approach³³ is used. This approach consists of progressively learning complex tasks, aimed at refining the models’ performance. In the proposed situation, the model is first trained on clear-weather images (first stage), then on foggy urban images with an initialization of the previous weights (second stage), and finally the model is trained on rainy images with an initialization of the previous foggy weights (third stage). Effectively, learning with mixed data of varying complexity can hinder the training process, slow down learning, or even increase the risk of overfitting. Overfitting issue is critical and impact the quality and performances of models. In our context, this issue can occur because the degraded weather datasets are smaller in size compared to the clear weather datasets. With the proposed architecture, the overfitting issue is reduced. Furthermore, the Xception pre-trained model is used as feature extractor for the three models, XCross-GlNet, XLoPB-Net and XDCT-Net. This model employs depthwise deformable convolutions, enhancing its computational efficiency compared to other architectures such as VGG and ResNet. Otherwise, preliminary tests have been conducted with VGG-19 and ResNet-50, which are less efficient in terms of accuracy. For the learning based methods, each dataset is split into 3 subsets for training, validation and testing. For the clear weather conditions, there are 20572 images in the training set, 6755 in the validation set and 1000 in the test set. For foggy weather conditions, there are 7169 images in the training set, 1283 in the validation set and 200 in the test set. For rainy weather conditions, there are 4235 images in the training set, 1265 in the validation set and 200 in the test set.

The models have been trained using the Keras/Tensorflow frameworks, employing the Adam optimizer with a learning rate of 0.001 and using binary cross-entropy as the loss function. The batch size is set to 32, epochs start at 40, and an early stopping function is used to reduce the risk of overfitting. The weights of foggy and rainy models are initialized with those of previous models learned on clear weather images (Fig. 4). By performing fine-tuning on models learned from clear-weather urban images, the models already have knowledge about the urban scenes studied here.

Statistic-based method

In this section, a comparison in terms of Haralick metrics is made between synthetic and real datasets acquired in diverse weather conditions. These analysis is applied on 100000 image patches with a resolution of Inline graphic pixels. The hue channel of the HSV color space was used for this analysis, as this color space allows better discrimination of image types using Haralick metrics. A min/max normalization is applied to all metrics to ensure that the results are within the range of 0–1. After computing these metrics, a Principal Component Analysis (PCA) is applied. This step is essential as Haralick metrics alone may not offer adequate information to determine which metric accurately represents the characteristics of real or synthetic datasets. The PCA is generally used to reduce the data dimensionality, but it is also an effective tool for analyzing and interpreting. This approach is employing to better understand the individual contribution of each metrics to the overall information, and the links between the various metrics and their contribution on each principal component (PC). As the first two components contain over 50% of the data’s information, we focus on the first two PCs for dataset analysis. Then, we compute the contribution of each metric to each principal component Inline graphic and with:

where k is the PC index, i is the metric index, Inline graphic is the eigenvalue associated to the , is the component of the vector for the metric and is the eigen vector.

The results of this study about the contributions to Inline graphic for the datasets acquired in clear and degraded (fog and rain) conditions have been summarized in bar charts (see Fig. 5). The blue dotted line indicates the value at which the contribution becomes significant. This value is set to 0.10. The contributions to are not shown here but this study has been done in³ and has showed they do not allow discriminating the different datasets. Specifically, each dataset has a contribution value of around 0.15 for the entropy (E), sum entropy (SE) and difference entropy (DE) metrics.

Fig. 5 — Bar charts representing Haralick metrics contributions to by the three weather conditions (left : computer-generated datasets, right : real datasets).

The right bar chart illustrates variations in contributions to Inline graphic of Haralick metrics across different weather conditions for real datasets: The contributions of the Svar and Var metrics increase in rainy conditions, while the contributions of the IMC2 metric increase in foggy conditions. Additionally, the Contrast metric shows increased contributions in both foggy and rainy conditions.

The left bar chart presents some variations in contributions to Inline graphic across weather conditions for synthetic datasets : In foggy conditions, the contribution of the Correlation metric decrease while those of Svar and Var metrics increase.

Furthermore, the bar charts illustrate the distinctive metrics of both synthetic and real datasets. For the three weather conditions, metrics such as Correlation and IMC1 are predominant in the left graph, dedicated to synthetic images, while Var and Svar metrics appear to be more indicative to real images. These metrics are selected to create a score. The contributions to Inline graphic of Correlation and IMC1 metrics, which are representative of synthetic datasets, will be included in the fidelity score as penalties, using the associated correlation contributions. This allow to place greater emphasis on the metrics that are representative of real datasets. The sub-score HaMeC (Haralick Metrics Combination) is defined by the following equation :

where Inline graphic , , and are respectively the contributions to of Var, Svar, Correlation and IMC1 metrics. is the eigenvalue associated to the . Equation 8 is using arithmetic average of the correlation contributions of the selected metrics. The weightings of individual metrics are distributed equally between the different indicators, as no one is predominant. It considers all the best contribution to Inline graphic for both synthetic and real datasets.

A synthetic image may be less faithful and realistic than a real image due to several factors that affect its texture and overall modeling and generation. The Haralick metrics like Entropy, Variance, Correlation, and Difference Variance can provide some reason of this gap in the fidelity:

Lack of Variability and Complexity (SuVar, DVar, and Entropy): Real images have complex textures with significant variability in the differences and sums of grayscale levels, which is reflected in high SVar and DVar values. Real textures exhibit unpredictable grayscale transitions and large local variations, giving a more natural and dynamic appearance. In contrast, simulated images may lack this complexity or have limited variation in the differences and sums of grayscale levels, producing simpler or more artificial textures (use of procedural and seamless textures with a limited resolution to avoid the GPU resource overloading). This can lead to low SVar and low DVar, making the image less realistic.
Low Entropy: Real images tend to have grayscale levels distributed randomly and unpredictably, with high entropy values. This characteristic reflects the complexity and unpredictability of natural textures. Synthetic images, however, may be generated with more homogeneous or less complex textures, resulting in low entropy values. This lacks the necessary diversity to simulate realistic and complex textures, making the image less faithful to reality. Moreover, the lights generation, shadows computation, and linear and low level 3D modeling (low polygons) of the environment (straight walls, corners, ...) impact strongly the entropy value.
Correlation and Homogeneity: In real images, neighboring grayscale levels often have a high correlation and follow regular or structured patterns. Correlation and homogeneity between neighboring pixels are typical of natural textures. Simulated images, on the other hand, may have weaker correlation or less regular artificial patterns. This can make synthetic images less realistic as they lack the continuity or fluidity of natural textures.
Grayscale Transitions (Difference Variance): Real images can exhibit marked variations in the differences between neighboring grayscale levels (resulting in high Difference Variance), creating texture effects such as sharp edges or abrupt contrasts, typical of natural environments. Synthetic images may have more homogeneous and less variable grayscale differences (with low DVar), resulting in a flatter or less detailed texture, which diminishes their realism.
Hue Constraints: Real images often feature a wide range of hues that are naturally balanced across various light conditions and environmental factors, with smooth transitions between colors. The variation in hues across different surfaces adds richness to the image, which is difficult to replicate accurately in simulations. Simulated images, however, may not capture the same depth of hue transitions, resulting in colors that appear more uniform or less natural. This limits the fidelity of the synthetic image, as the subtle shifts in hue that occur in real-world settings are often absent.
Contrast Constraints: Real images exhibit a broad contrast range, from deep shadows to bright highlights, depending on light sources, surface types, and scene composition. The contrast between light and dark areas helps to define textures and adds a sense of depth to the scene. Synthetic images may suffer from limited contrast, with areas of the image appearing flat or lacking sufficient differentiation between light and dark regions. This reduced contrast can make the texture appear artificial, and the scene can lack the visual depth and realism typical of real-world environments.
Noise and blur Constraints: Real images typically contain a certain amount of noise due to imperfections in the camera sensor, lighting conditions, or environmental factors. This noise can be grainy, like film grain or sensor noise, which adds a layer of realism to the image. Simulated images may be generated with little or no noise, producing an overly smooth or perfect appearance. The absence of noise can make the image feel sterile or unnatural, as it lacks the imperfections and unpredictability that occur in real-world images.

In future work, the study of Haralick metrics should be useful in order to identify the weaknesses of a set of synthetic images and the parameters to modify in order to increase the level of fidelity. It is also obvious than the using of these metrics in GAN approaches could provide efficient way to increase the fidelity level of synthetic images till to reach the same score than real images.

Experimental results

Several clues need to be considered when quantifying the fidelity of images due to the complexity of real scenes. Therefore, we propose a set of scores including the models and the selected Haralick metrics is proposed, that provides a more comprehensive assessment of fidelity. Specifically, this section presents the fidelity scores obtained from the learning and statistic-based methods for different datasets in three weathers conditions : clear, fog and rain. The fidelity scores generated from the models (GLCM, DCT, LPB) correspond to the prediction functions acquired through the Keras/Tensorflow frameworks.

Table 2 presents the fidelity scores of synthetic and real datasets employing the different proposed methods. While the scores obtained from the real datasets (right part of the table) provide valuable information regarding result consistency, the primary objective is to assess the level of fidelity in the synthetic datasets. The scores obtained for all the datasets reveal a substantial disparity between the synthetic and the real data. As expected, the evaluated synthetic datasets exhibit relatively low fidelity due mainly to the mechanism used to apply the rendering process and generate the synthetic image.

Table 2.

Fidelity scores predictions/accuracy (%) of XCross-GlNet, XDCT-Net and XLoPB-Net on synthetic and real test sets (1000 images).

Pred/Acc	Kitti	City	ONCE	NuSC	GTA	vKitti	KittiC	Synthia
GLCM	99.6/100	96.5/97.7	69.9/70.6	96.2/97.9	9.9/90.4	10.2/93.81	0.31/100	14.45/85.7
(std)	(0.01)	(0.13)	(0.33)	(0.02)	(0.26)	(0.20)	(0.02)	(0.31)
LBP	73.5/79.3	84.7/91.7	80.8/87.6	82.3/90.5	23.7/77.4	09.5/94.8	07.5/98.2	36.0/65.3
(std)	(0.27)	(0.20)	(0.23)	(0.18)	(0.35)	(0.16)	(0.12)	(0.38)
DCT	83.7/89.9	92.7/97.9	45.51/41.9	88.8/94.7	11.9/90.9	03.9/99.8	06.6/98.00	13.2/87.4
(std)	(0.34)	(0.12)	(0.31)	(0.12)	(0.35)	(0.08)	(0.13)	(0.24)
HaMeC	53.8	66.1	70.1	73.1	35.7	33.9	38.0	40.1

Open in a new tab

Table 3 presents the fidelity scores of foggy datasets. The Dawn dataset is the real one, Foggy Cityscapes is built by adding synthetic fog on real Cityscapes images, and Foggy vKitti is completely computer-generated. The results clearly highlight the gap between real and synthetic datasets. It is also relevant to note that the merging of real and virtual data in a same dataset (Foggy City) will probably lead to the generation of low scores.

Table 3.

Fidelity scores predictions/accuracy (%) of XCross-GlNet, XDCT-Net and XLoPB-Net on synthetic (Foggy vKitti), half-real (Foggy City) and real (Dawn) foggy test sets (200 images).

Pred/Acc	Dawn	Foggy City	Foggy vKitti
GLCM	95.3/97.5	11.04/93.7	1.4/99.7
(std)	(0.18)	(0.18)	(0.01)
LBP	92.7/93.5	4.0/97.60	10.4/94.2
(std)	(0.20)	(0.08)	(0.20)
DCT	78.3/80.9	4.1/98.10	6.2/94.2
(std)	(0.31)	(0.19)	(0.01)
HaMeC	86.2	84.1	38.9

Open in a new tab

Table 4 presents the fidelity scores for the rainy datasets. RID and ACDC represent real datasets, while the Rain City dataset is built by adding synthetic rain on real Cityscapes images, and the Synthia dataset is computer-generated. Fig. 6 illustrates sample images from the selected datasets used in this study, in various weather conditions for both synthetic and real-world cases. Augmented images with synthetic weather conditions are shown (Rain City and Foggy City datasets), highlighting the visual differences with standard synthetic images with rain and fog (Foggy vKitti and Rain Synthia datasets). The corresponding fidelity scores for each dataset are also provided.

Table 4.

Fidelity score predictions/accuracy (%) of XCross-GlNet, XDCT-Net and XLoPB-Net on synthetic (Rain City), half-real (Rain City) and real (RID, ACDC) rainy test sets (200 images).

Pred/Acc	RID	ACDC	Rain City	Rain Synthia
GLCM	68.8/72.9	98.8/100	3.1/98.0	1.2/100
(std)	(0.38)	(0.03)	(0.11)	(0.02)
LBP	74.3/77.3	93.9/94.6	1.0/100	5.4/97.5
(std)	(0.34)	(0.19)	(0.04)	(0.07)
DCT	76.3/77.3	99.8/100	5.0/100	2.2/99.3
(std)	(0.37)	(0.02)	(0.11)	(0.11)
HaMeC	81.9	79.2	39.6	09.4

Open in a new tab

Fig. 6 — Example of images from different datasets in different weather conditions with the associated fidelity scores.

Table 5 presents some results that analyze detection metrics and fidelity scores for two scenes (447 images for S1 and 339 images for S18) in clear and foggy conditions from the vKitti v1.3.1 and vKitti v2 datasets. The Yolov5s model is used to perform detections. The comparison is particularly relevant, as vKitti v2 provides a more faithful rendering than its predecessor. This enhancement results from advanced post-processing techniques integrated into the Unity game engine. The results generally show better results in terms of detection, but also in terms of fidelity. However, for the GLCM metric applied to scene S1 in clear weather, the result is below expectation. Moreover, results on foggy datasets are mixed. In terms of detection metrics, vKitti v1 slightly outperforms v2 on recall and mAP50, although the differences remain modest. Regarding fidelity scores, vKitti v1 yields better results for GLCM and DCT, while the other two metrics show an advantage for vKitti v2. These variations may be attributed to the model’s capacity to detect objects under degraded conditions, the specific fog modeling applied in each version, and the robustness of the metrics themselves when applied to foggy images.

Table 5.

Detection metric computation and fidelity scores for both vKitti v1 and v2 for two scene types of scenes. The best values are in bold.

Metrics	vKittiv1 S1	vKittiv2 S1	vKittiv1 S18	vKittiv2 S18	vKittiv1 S18 (fog)	vKittiv2 S18 (fog)
Precision	0.699	0.766	0.969	0.969	0.863	0.906
Recall	0.488	0.471	0.416	0.436	0.469	0.452
mAP50	0.516	0.552	0.486	0.494	0.571	0.558
mAP50-95	0.238	0.264	0.244	0.286	0.279	0.295
GLCM	33.82	1.20	1.19	48.44	3.78	2.11
LBP	12.86	12.95	33.60	31.91	0.10	0.41
DCT	5.13	5.51	0.22	1.61	4.57	1.93
HaMeC	48.34	57.72	42.82	80.71	64.69	79.38

Open in a new tab

What emerges from the results obtained shows that overall, real data have higher scores than synthetic or half synthetic data such as the Rain City and Foggy City datasets. But these results are also difficult to exploit directly and can be conflicting. It is necessary to use a method to merge efficiently these different scores focused on specific features and aspect of the images, and study the uncertainties and possible conflicts between these scores. In order to build a fusion operator, only the theories modeling and managing reliability, uncertainties, and conflict can be applied. Among these theories, we can quote the probability, the belief theory, the fuzzy logic, the possibility theory, and the theory of intervals. After a study of these different theories, the more efficient and appropriate theory is the belief theory. The next section presents the methodology applied using this theory.

Multi criteria combination with belief theory

Belief Theory for data combination

Belief functions formalism was proposed by Dempster³⁴ with its generalization of Bayesian inference and Shafer in 1976³⁵. Then, it was extended to the Transferable Belief Model (TBM) proposed by Smets³⁶³⁷. This theory is mainly used to address problems of data association and data fusion, combining known information (hypotheses modeling knowledge about the workspace) with observations (environmental measurements). The fusion of information from multiple sources improves reliability and reduces the influence of faulty data. For example, in a multi-target tracking problem, hypotheses represent the tracks, and observations correspond to targets¹. This theory is particularly useful because it accounts for data reliability, models both known and unknown information, and manages incomplete, heterogeneous, and potentially asynchronous data. Furthermore, belief theory provides a more general formalism than probability theory for modeling and managing uncertainty. It also allows dynamic handling of new hypotheses (valuable in multi-target tracking) and conflicts when needed.

From a general point of view, Belief theory calculates the veracity of a proposition linking one or more hypotheses to an observation. This veracity is modeled using belief masses. The belief mass is defined as the mass of elementary probability on an event A:

This mass is calculated for each A of the referential defined as the power-set Inline graphic of which includes all the admissible hypotheses. , called frame of discernment involves all hypotheses and must also be exclusive, i.e. . is similar to the sample space in probability theory).

The power-set Inline graphic includes all possible propositions or questions that can be formulated. These propositions can be simple (one observation and one hypothesis) or complex (one observation and multiple hypotheses). In probability theory, the reference framework corresponds to the event space or the power set of the sample space. For instance, for Inline graphic modeling the current state of a tracking algorithm with 4 known tracks, the power-set will be

where Inline graphic represents the total ignorance with 4 being the number of tracks at the current instant k and the mass on conflict.

The set of masses over the propositions obtained after combination forms the focal elements and the mass set. Like probabilities, the sum of these masses equals 1, and the mass assigned to the empty set Inline graphic must be zero. Otherwise, a conflict can only be generated after the combination. Typically, the conjunctive Dempster combination rule (orthogonal sum) is used to combine masses. This rule aggregates n basic belief assignments (BBAs) calculated with basic belief functions (BBFs) into a single mass set by forming conjunctive sets of focal elements across sources, modeling the relationship between an observation and a hypothesis. The BBAs verify the following assumption:

A subset Inline graphic , such as is called a focal element. When , the BBA is said to be normal. In the majority of problems, the initial BBAs (also call specialized sources) are built with a set of 3 specific masses:

: The observation i matches the hypothesis .
: The observation i does not match the hypothesis .
: The source is uncertain whether the observation i matches .

These mass sets are systematically and pragmatically generated, with a specialized triplet of BBAs constructed for each observation-hypothesis pair. Fig. 7 summarizes the main stages of the Belief Theory.

Fig. 7 — Streamlined diagram of Belief Theory.

When several sources of information give their opinion on a given situation, numerous rules to combine them can be used, among them we can quote the Conjunctive Rule of Combination (CRC) and the Dempster-Shafer (DS) rule.

Depending on the problem to address and the available knowledge about the World various frameworks can be used, each impacting how data is modeled, exploited, and interpreted. There are three main frameworks:

Open World: The frame of discernment is non-exhaustive, allowing for the existence of new hypotheses. In this context, any mass on conflict generates a new hypothesis. However, distinguishing between a new hypothesis and an actual conflict is not possible in this framework.
Closed World: The world is assumed to be completely known. Here, conflict masses indicate genuine conflicts, likely signaling an issue in the problem’s modeling. Such conflicts are often redistributed over the focal elements using a renormalization operator.
Extended Open World: The knowledge of the world is non-exhaustive, but an additional hypothesis is introduced into the frame of discernment to represent unknown parts of the world (the rest of the World) and to obtain an exhaustive frame of discernment. The combination generates a mass on for new hypotheses, while the conflict mass represents a genuine conflict. This framework is well-suited for multi-target tracking, where the dynamic appearance and disappearance of hypotheses (tracks) must be managed.

The management of the conflict is an important issue of this theory. A major problem of this normalization has been pointed out by Zadeh³⁸ and largely discussed in the literature³⁹. In fact, conflict is a kind of information in itself, and the origin of this conflict becomes an issue (exhaustivity of the frame of discernment, source reliability, etc.).

After applying the fusion operator combining the BBAs, the final mass set will consist of the masses corresponding to the focal elements of the definition frame of reference. In order to simplify the interpretation of the result, this mass set will be limited to 4 types of masses:

: The singleton affirmative masses about relationship between observation i and hypothesis after combination of all BBAs for all hypothesis and observation i.
: The mass on negation (observation i has no relation with the hypotheses ).
: The mass on ignorance which sums the masses on the different complex propositions modeling the uncertainty (from level 2 to n).
: This mass models the level of conflict between the BBAs conflict.

The notation (.) in the final set of masses indicates that all masses associated with the hypotheses within the frame of discernment have been aggregated. In this context, Inline graphic denotes the mass assigned to the relationship between observation or data source i and hypothesis j, resulting from the combination of BBAs between observation i and all hypotheses from 1 to n.

In our problem, 4 observations will be measured (the 4 scores) and will provide 4 BBAs. The frame of discernment will be very simple because only one hypothesis has to be used: Hypothesis of validity. Inline graphic and is exhaustive. This means that only the closed word is efficient and adapted to solve our problem. In this context, we don’t apply multi-targets or a multi-objects combination rules (only one hypothesis), but a multi-criteria combination rule as presented in⁴⁰. The next section details how to generate the BBAs and which combination rules need to be applied.

Mass-set generation

In order to generate the BBAs, 2 stages are currently used. The first one consists to apply a similarity or dissimilarity metric in order to assess the level of correlation between two sets of data (observations and hypothesis). The second stage applies the computed similarity in BBFs in order to generated the triplets of BBAs. If the observation is made of several variables then the BBAs could be obtained by using similarity operators adapted to each variable, BBFs, and multi-criteria combination rules as presented in⁴⁰. The result of such processing can provide an advanced belief similarity operator. However, in our context, the objective is not to generate a similarity index with belief theory, as mentioned in⁴⁰, but to adapt this methodology in order to combine a set of scores Inline graphic (the observation) obtained from the 4 previous pre-processing functions. Fig. 8 presents an overview of the multi-criteria combination methodology that we propose for synthetic multi-scores fidelity assessment. This methodology is shared in 3 layers.

Fig. 8 — Overview of the multi-criteria combination method for the assessment of a global fidelity score involving uncertainty and potential conflict detection. The input image comes from the vKitti dataset¹⁵. This figure has been made by the authors of the manuscript. https://europe.naverlabs.com/proxy-virtual-worlds-vkitti-2/.

In the first layer, each score is calculated in the image space and takes a value in [0..100]. Before to enter in the second layer, each score is translated in the space [0..1]. The objective consists to model the set of scores Inline graphic as similarity indices, making the interface between initial score space and the symbolic belief space.

The second layer is an interface between the image space and the symbolic belief space and to convert the scores into BBA by using a set of Basic Belief Functions (BBF). The BBA named Inline graphic have the following form :

The different mass-sets have to be calculated for each criterion, thanks to a mass-generative function BBF, in order to combine them with a multi-criteria combination operator. A mass generative function must respect the following form : Inline graphic and . corresponds to the similarity indices coming from the score k, such as , and is a set of masses which have to satisfy the following requirement:

For each score (information source about the fidelity), the value is bounded to the interval [0, 1], the maximum score being equal to 1 (the image is considered as similar to actual data). The scores Inline graphic coming from one of the four proposed methods, cannot at the same time assert that an evaluated image is and is not faithful to reality. However, at the same time, we can obtain a mass on H and , or on and . This means that the current score quantifies a positive opinion about the fidelity of the data, but with a doubt (level of uncertainty). This means that a BBA cannot be generated with both focal elements H and Inline graphic . In fact, considering that the sources of information are reliable, the physical meaning leads to define the BBA by limiting the mass of conflict as follows:

where Inline graphic are BBF and a threshold sharing the function space. They were constructed to ensure that the H and hypotheses are strictly non-overlapping, since one is the complement of the other. H represents fidelity to reality, while does not. The coefficient can be seen as a specific coefficient discounting each source, as proposed by Appriou⁴¹, with the definition of specialized sources. In our context, the coefficient Inline graphic can be interpreted as a reliability coefficient, used to measure the efficiency of the fusion, regarding the method provided a given score (see Fig. 9a). Therefore, is considered as constant over time and for any image processed by a method (GLCM, DCT, LPB, and ). However, a recent work has been done in Univ. Eiffel on the development of an adaptive methodology allowing to optimize these BBF parameters and functions using parametric functions⁴². This work will be applied in a future version of the belief combination of fidelity scores In a generic framework, both functions Inline graphic and must be chosen or built as bijective functions and must satisfy the following constraints:

Fig. 9 (a) and (b) are a special case where Inline graphic . If , the system is considered to be neutral. If is less than 0.5, then the system is considered to be optimistic, because the model tends to assign a mass to H, even if the similarity is weak. If is greater than 0.5, then the system is considered to be pessimistic because the similarity should be high in order to assign a mass to H.

Multi-criteria combination rules

Finally, the third layer combines the BBAs in order to obtain a final set of masses allowing to assess the level of fidelity but with the capability to assess too the level of uncertainty and the level of conflict. Moreover, for the layers 2 and 3, the frame of discernment is built with only one hypothesis H represented the assertion is faithful to the reality produced by a camera.

The specialized sources are expressed on the common triplet of masses represented by

which characterized an advice on the level of fidelity of a synthetic or real image depending on a specific processing method. In the closed world, with the combination of the 2 first BBAs, we obtain the following equation:

The combination of the third BBA provides the following set of equations:

If the process is repeated n times, the same type of equation is obtain. That means it is possible to propose a set of generic equations for n sources. The generalized conjunctive combination rules⁴⁰ are the following:

with Inline graphic and

This set of combination rules is very interesting because it offers several metrics to interpret the combination and not just to give a simple level of fidelity of a data. Indeed, these rules allow to detect the existence of a conflict ( Inline graphic ) between sources and criteria, to produce a level of dissimilarity (), and also to reveal the potential ambiguity between similarity and dissimilarity (fidelity and inconsistency). In addition, a specific rule allows to quantify the level of uncertainty () and therefore the level of ignorance carried by the scores and ultimately the result of the combination. In addition, by using the BBAs generated with the BBFs, it is possible to take into account the reliability of the sources. An unreliable source will result in a BBA with a higher mass on the uncertainty ( Inline graphic )which will be reflected in the result of the combination. This generalized rule set is also independent of the number of sources to be combined. This means that it is easy to add additional scores if needed. Moreover, in the case of the HaMeC score, 4 Haralick metrics are merged (the most significant from the point of view of principal component analysis) to obtain a synthetic statistical score. However, with the proposed methodology, it is possible to use the 4 metrics separately and generate initial mass triplets (BBA) for each of them. Extending this reasoning, it is also possible to take into account all 14 Haralick metrics and weight the generation of BBAs using for each metric a reliability that reflects its level of importance in the PCA.

Implementation on datasets, results and analyzing

In this subsection, the multi-criteria combination method is applied to merges the fidelity scores generated by the 4 methods (GLCM, LPB, DCT and HaMeC). This analysis is conducted across the clear (Fig. 10), rainy and foggy datasets (Fig. 11). These figures display the graphs obtained from the multi-criteria combination and the generation of BBA with BBF. They will help to establish a level of fidelity (H) or non fidelity ( Inline graphic ), the level of uncertainty (Omega), and the detection of conflict (Empty) between scores. Each criterion has been assigned with reliabilities and values. The values are set here to be pessimistic with = 0.6. The model tends to allocate mass on , corresponding to the assertion is not faithful to the reality produced by a camera. It must not be too optimistic to avoid overestimating the fidelity of the synthetic image. On this basis, the compromise was to use a value slightly higher than the neutral value of 0.5. Currently, Inline graphic is fixed, but it will be optimized as part of future work. The reliabilities associated with the criteria, obtained from the learning-based models, correspond to models’ accuracy. The reliability assigned to the criterion is set to 0.5 as it is impossible to obtained a similar accuracy to the learning-based methods.

Fig. 11 — Graphs resulting from the multi-criteria combination (bar plots) and the generation of BBA with BBF (radar charts). The results are obtain from the foggy (top row) rainy datasets (down row).

Fig. 10 presents the graphs obtained from the multi-criteria combination and the generation of BBA with BBF, obtained from the clear weather datasets. A global analysis of the graphs reveals that real datasets exhibit a higher tendency to H (fidelity), whereas synthetic datasets have a strong tendency to Inline graphic (non fidelity), with noticeable uncertainties for the Synthia datasets (12%). Only two real datasets are shown (Kitti and Cityscapes), as they are only used as references for the synthetic data. The red bar graphs, resulting from the generation of BBA, present the outcomes concerning the triplet of hypotheses Inline graphic for each criterion. This allows to obtain detailed results of the fidelity scores with the levels of fidelity, non fidelity and uncertainty.

In this figure, results from an enhanced GTA dataset, created by²⁶ using a GAN-based image translation technique, have been included. This method consists of improving the photorealism of the GTA dataset. This decision seems relevant as this enhanced version is expected to be more faithful to reality. Following this hypothesis, the enhanced GTA version should present a reduced tendency to Inline graphic and a greater tendency to H compared to the GTA V dataset. Regarding the latter : m(H) = 0%, 0%, 0%, 0%, = 77%, 33%, 78%, 1%, = 23%, 67%, 21%, 99%. For the enhanced GTA : m(H) = 0%, 12%, 0%, 35%, = 0%, 0%, 33%, 0%, = 99%, 67%, 21%, 65%. It indicates a weaker tendency to than the GTA V dataset, and enhancements for H and Inline graphic . These results further support the initial hypothesis that the enhanced GTA dataset would enhance fidelity.

Fig. 11 shows the graphs resulting from the multi-criteria combination and the generation of BBA with BBF, obtained from the foggy (top row) and rainy (down row) weather datasets. Similarly to the clear weather scenario, synthetic and half real datasets have a strong tendency to Inline graphic . The foggy Cityscapes dataset, which is half real (fog has been added artificially), has a tendency to (58%) and presents some conflicts (42%). Specifically, m(H) = 0%, 0%, 0%, 42%, = 77%, 95%, 95%, 0%, = 22%, 4%, 4%, 58%.

Conclusion and future works

In the development of next-generation automated mobility systems, the evaluation and validation processes have become essential in ensuring the reliability and performance of the components and functions that enable driving automation. These components increasingly rely on AI-based systems, which demand an extensive and time-consuming learning phase supported by large, diverse, and representative datasets. Acquiring such datasets in real-world scenarios presents significant challenges due to safety, cost, and the need for a broad range of environmental and situational variations. As a result, simulation methods are becoming a prevalent, relevant, and cost-effective solution for generating synthetic datasets.

However, when generating synthetic road images that encompass a wide variety of road configurations, traffic scenarios, and adverse weather conditions, a critical challenge remains: the absence of reliable and standardized methods for assessing the fidelity of these images. High-fidelity synthetic datasets are crucial, not only for AI model training but also for the subsequent validation and performance evaluation stages. Without rigorous fidelity evaluation, the risk arises that models trained on low-quality synthetic data may underperform or behave unpredictably when deployed in real-world environments.

In this work, we propose a comprehensive methodology to assess the fidelity of computer-generated images under various weather conditions using a diverse set of quantitative metrics. Among these metrics, we highlight the use of GLCM, LPB, DCT, and Haralick texture features, which are particularly effective in capturing the structural complexity and variability of textures in images. Specifically, we introduce the HaMeC score, a fusion of four significant Haralick metrics (Variance, Sum of Variance, correlation, and IMC1), which provides a robust measure of texture fidelity (identified from PCA). This score allows for the identification of critical weaknesses in synthetic images by highlighting areas where key texture properties diverge from those of real images. Additionally, four objective fidelity scores are combined through a belief theory framework, which produces a final aggregated score offering four key insights: (1) the degree of fidelity of the dataset, (2) the degree of non-fidelity, (3) the level of uncertainty, and (4) the detection of conflict between local scores from individual metrics. An important advantage of our approach is that the generalized multi-criteria combination rules used in the fusion process are adaptable to varying numbers of sources and can incorporate source reliability, enhancing robustness.

Beyond fidelity assessment, the proposed method provides valuable insights into specific parameters of synthetic images that require improvement. By analyzing weaknesses identified through the HaMeC score and other fidelity metrics, we can pinpoint key image characteristics such as contrast, hue, color balance, blur and noise levels, and the complexity of texture patterns. On this basis, it would be useful if the identification of these features could be used to improve the fidelity of computer-generated images by targeting the areas that lack realism. Moreover, these parameters can serve as direct inputs to generative models, such as Generative Adversarial Networks (GANs), with the aim of regenerating synthetic images that more closely resemble real-world images. The feedback loop enabled by this approach allows for iterative refinement of virtual datasets, improving the realism of textures, materials, and lighting conditions. These previous remarks and proposal will be the starting point to investigate the impact of different rendering pipelines and image generation methods, with the goal of optimizing virtual image generation and improving the overall quality of synthetic datasets used in automated mobility systems.

Furthermore, this process can facilitate the extension of existing datasets by generating new scenarios that were previously under-represented, thereby increasing dataset coverage. Incorporating a broader range of realistic scenarios, including rare and adverse conditions, will enhance the robustness and generalization capabilities of AI models trained on these datasets. Also, in future work, we aim to explore additional texture features, such as fractal-based metrics, which may further enrich the characterization of image complexity. Investigating feature-based vision transformers to capture the fine-grained details of synthetic images could also help to improve the performance of our proposed models. Additionally, expanding our analysis to encompass different scene types-such as rural roads, highways, and urban environments-will provide a more comprehensive understanding of fidelity across various contexts.

Author contributions

A.D., S.I. and D.G. conceived the experiments, A.D. conducted the experiments, A.D., S.I. and D.G. analysed the results. All authors reviewed the manuscript.

Data availability

All data are included in the manuscript.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Gruyer, D., Pechberti, S. & Glaser, S. Development of full speed range ACC with sivic, a virtual platform for adas prototyping, test and evaluation. In 2013 IEEE Intelligent Vehicles Symposium (IV) 100–105 (IEEE, 2013).
2.Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. Carla: An open urban driving simulator. In Conference on Robot Learning 1–16 (PMLR, 2017).
3.Duminil, A., Ieng, S.-S. & Gruyer, D. A comprehensive exploration of fidelity quantification in computer-generated images. Sensors24, 10.3390/s24082463 (2024). [DOI] [PMC free article] [PubMed]
4.Haralick, R. M., Shanmugam, K. & Dinstein, I. H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 610–621 (1973).
5.Ye, X., Backlund, P., Ding, J. & Ning, H. Fidelity in simulation-based serious games. IEEE Trans. Learn. Technol.13, 340–353 (2019). [Google Scholar]
6.Reway, F. et al. Test method for measuring the simulation-to-reality gap of camera-based object detection algorithms for autonomous driving. In 2020 IEEE Intelligent Vehicles Symposium (IV) 1249–1256 (IEEE, 2020).
7.Prabhu, V. et al. Bridging the sim2real gap with care: Supervised detection adaptation with conditional alignment and reweighting. arXiv preprint arXiv:2302.04832 (2023).
8.Ngo, A., Bauer, M. P. & Resch, M. A multi-layered approach for measuring the simulation-to-reality gap of radar perception for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) 4008–4014 (IEEE, 2021).
9.Huch, S., Scalerandi, L., Rivera, E. & Lienkamp, M. Quantifying the lidar sim-to-real domain shift: A detailed investigation using object detectors and analyzing point clouds at target-level. IEEE Trans. Intell. Veh. (2023).
10.Li, X. et al. Underwater image quality assessment from synthetic to real-world: Dataset and objective method. ACM Trans. Multimed. Comput. Commun. Appl.20, 1–23 (2023). [Google Scholar]
11.Gadipudi, N. et al. Synthetic to real gap estimation of autonomous driving datasets using feature embedding. In 2022 IEEE 5th International Symposium in Robotics and Manufacturing Automation (ROMA) 1–5 (IEEE, 2022).
12.Valdebenito Maturana, C. N., Sandoval Orozco, A. L. & García Villalba, L. J. Exploration of metrics and datasets to assess the fidelity of images generated by generative adversarial networks. Appl. Sci.13, 10637 (2023). [Google Scholar]
13.Scabini, L. et al. A comparative survey of vision transformers for feature extraction in texture analysis. arXiv preprint arXiv:2406.06136 (2024).
14.Yang, F., Yang, H., Fu, J., Lu, H. & Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5791–5800 (2020).
15.Cabon, Y., Murray, N. & Humenberger, M. Virtual kitti 2 (2020). arXiv:2001.10773.
16.Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. (IJRR) (2013).
17.Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
18.Mao, J. et al. One million scenes for autonomous driving: Once dataset (2021).
19.Caesar, H. et al. nuscenes: A multimodal dataset for autonomous driving. In CVPR (2020).
20.Kenk, M. A. & Hassaballah, M. Dawn: Vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402 (2020).
21.Li, S. et al. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3838–3847 (2019).
22.Sakaridis, C., Dai, D. & Van Gool, L. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021).
23.Deschaud, J.-E. Kitti-carla: A kitti-like dataset generated by carla simulator. arXiv preprint arXiv:2109.00892 (2021).
24.Ros, G., Sellart, L., Materzynska, J., Vazquez, D. & Lopez, A. M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3234–3243 (2016).
25.Richter, S. R., Vineet, V., Roth, S. & Koltun, V. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 102–118 (Springer, 2016).
26.Richter, S. R., AlHaija, H. A. & Koltun, V. Enhancing photorealism enhancement. arXiv:2105.04619 (2021). [DOI] [PubMed]
27.Sakaridis, C., Dai, D. & Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vision126, 973–992 (2018). [Google Scholar]
28.Hu, X., Fu, C.-W., Zhu, L. & Heng, P.-A. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 8022–8031 (2019).
29.Barni, M., Kallas, K., Nowroozi, E. & Tondi, B. Cnn detection of gan-generated face images based on cross-band co-occurrences analysis. In 2020 IEEE International Workshop on Information Forensics and Security (WIFS) 1–6 (IEEE, 2020).
30.Ojala, T., Pietikainen, M. & Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.24, 971–987 (2002). [Google Scholar]
31.Cabeen, K. & Gent, P. Image compression and the discrete cosine transform. College of the Redwoods (1998).
32.Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).
33.Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning 41–48 (2009).
34.Dempster, A. P. A generalization of bayesian inference. J. Roy. Stat. Soc.: Ser. B (Methodol.)30, 205–232 (1968). [Google Scholar]
35.Shafer, G. A Mathematical Theory of Evidence (Princeton University Press Princeton, 1976). [Google Scholar]
36.Smets, P. The combination of evidence in the transferable belief model. IEEE Trans. Pattern Anal. Mach. Intell.12, 447–458 (1990). [Google Scholar]
37.Smets, P. Data fusion in the transferable belief model. In Proceding of the Third International Conference on Information Fusion 10–13 (Citeseer, 2000).
38.Zadeh, L. A simple view of the dempster-shafer theory of evidence and its implication for the rule of combination. Artif. Intell.2, 85–90 (1986). [Google Scholar]
39.Haenni, R. Shedding new light on zadeh’s criticism of dempster’s rule of combination. In 8th International Conference on Information Fusion vol. 2, 6 pp 10.1109/ICIF.2005.1591951 (2005).
40.Magnier, V., Gruyer, D. & Godelle, J. Multi-criteria similarity operator based on the belief theory: Management of similarity, dissimilarity, conflict and ambiguities. In 2017 IEEE Intelligent Vehicles Symposium (IV) 1215–1221 (IEEE, 2017).
41.Appriou, A. Situation assessment based on spatially ambiguous multisensor measurements. Int. J. Intell. Syst.16, 1135–1166 (2001). [Google Scholar]
42.Jacquemart, A., Ieng, S.-S., Hadj-Bachir, M. & Gruyer, D. A new method for parametric bbf generation. In 2024 27th International Conference on Information Fusion (FUSION) 1–8, 10.23919/FUSION59988.2024.10706296 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data are included in the manuscript.

[CR1] 1.Gruyer, D., Pechberti, S. & Glaser, S. Development of full speed range ACC with sivic, a virtual platform for adas prototyping, test and evaluation. In 2013 IEEE Intelligent Vehicles Symposium (IV) 100–105 (IEEE, 2013).

[CR2] 2.Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. Carla: An open urban driving simulator. In Conference on Robot Learning 1–16 (PMLR, 2017).

[CR3] 3.Duminil, A., Ieng, S.-S. & Gruyer, D. A comprehensive exploration of fidelity quantification in computer-generated images. Sensors24, 10.3390/s24082463 (2024). [DOI] [PMC free article] [PubMed]

[CR4] 4.Haralick, R. M., Shanmugam, K. & Dinstein, I. H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 610–621 (1973).

[CR5] 5.Ye, X., Backlund, P., Ding, J. & Ning, H. Fidelity in simulation-based serious games. IEEE Trans. Learn. Technol.13, 340–353 (2019). [Google Scholar]

[CR6] 6.Reway, F. et al. Test method for measuring the simulation-to-reality gap of camera-based object detection algorithms for autonomous driving. In 2020 IEEE Intelligent Vehicles Symposium (IV) 1249–1256 (IEEE, 2020).

[CR7] 7.Prabhu, V. et al. Bridging the sim2real gap with care: Supervised detection adaptation with conditional alignment and reweighting. arXiv preprint arXiv:2302.04832 (2023).

[CR8] 8.Ngo, A., Bauer, M. P. & Resch, M. A multi-layered approach for measuring the simulation-to-reality gap of radar perception for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) 4008–4014 (IEEE, 2021).

[CR9] 9.Huch, S., Scalerandi, L., Rivera, E. & Lienkamp, M. Quantifying the lidar sim-to-real domain shift: A detailed investigation using object detectors and analyzing point clouds at target-level. IEEE Trans. Intell. Veh. (2023).

[CR10] 10.Li, X. et al. Underwater image quality assessment from synthetic to real-world: Dataset and objective method. ACM Trans. Multimed. Comput. Commun. Appl.20, 1–23 (2023). [Google Scholar]

[CR11] 11.Gadipudi, N. et al. Synthetic to real gap estimation of autonomous driving datasets using feature embedding. In 2022 IEEE 5th International Symposium in Robotics and Manufacturing Automation (ROMA) 1–5 (IEEE, 2022).

[CR12] 12.Valdebenito Maturana, C. N., Sandoval Orozco, A. L. & García Villalba, L. J. Exploration of metrics and datasets to assess the fidelity of images generated by generative adversarial networks. Appl. Sci.13, 10637 (2023). [Google Scholar]

[CR13] 13.Scabini, L. et al. A comparative survey of vision transformers for feature extraction in texture analysis. arXiv preprint arXiv:2406.06136 (2024).

[CR14] 14.Yang, F., Yang, H., Fu, J., Lu, H. & Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5791–5800 (2020).

[CR15] 15.Cabon, Y., Murray, N. & Humenberger, M. Virtual kitti 2 (2020). arXiv:2001.10773.

[CR16] 16.Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. (IJRR) (2013).

[CR17] 17.Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).

[CR18] 18.Mao, J. et al. One million scenes for autonomous driving: Once dataset (2021).

[CR19] 19.Caesar, H. et al. nuscenes: A multimodal dataset for autonomous driving. In CVPR (2020).

[CR20] 20.Kenk, M. A. & Hassaballah, M. Dawn: Vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402 (2020).

[CR21] 21.Li, S. et al. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3838–3847 (2019).

[CR22] 22.Sakaridis, C., Dai, D. & Van Gool, L. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021).

[CR23] 23.Deschaud, J.-E. Kitti-carla: A kitti-like dataset generated by carla simulator. arXiv preprint arXiv:2109.00892 (2021).

[CR24] 24.Ros, G., Sellart, L., Materzynska, J., Vazquez, D. & Lopez, A. M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3234–3243 (2016).

[CR25] 25.Richter, S. R., Vineet, V., Roth, S. & Koltun, V. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 102–118 (Springer, 2016).

[CR26] 26.Richter, S. R., AlHaija, H. A. & Koltun, V. Enhancing photorealism enhancement. arXiv:2105.04619 (2021). [DOI] [PubMed]

[CR27] 27.Sakaridis, C., Dai, D. & Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vision126, 973–992 (2018). [Google Scholar]

[CR28] 28.Hu, X., Fu, C.-W., Zhu, L. & Heng, P.-A. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 8022–8031 (2019).

[CR29] 29.Barni, M., Kallas, K., Nowroozi, E. & Tondi, B. Cnn detection of gan-generated face images based on cross-band co-occurrences analysis. In 2020 IEEE International Workshop on Information Forensics and Security (WIFS) 1–6 (IEEE, 2020).

[CR30] 30.Ojala, T., Pietikainen, M. & Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell.24, 971–987 (2002). [Google Scholar]

[CR31] 31.Cabeen, K. & Gent, P. Image compression and the discrete cosine transform. College of the Redwoods (1998).

[CR32] 32.Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).

[CR33] 33.Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning 41–48 (2009).

[CR34] 34.Dempster, A. P. A generalization of bayesian inference. J. Roy. Stat. Soc.: Ser. B (Methodol.)30, 205–232 (1968). [Google Scholar]

[CR35] 35.Shafer, G. A Mathematical Theory of Evidence (Princeton University Press Princeton, 1976). [Google Scholar]

[CR36] 36.Smets, P. The combination of evidence in the transferable belief model. IEEE Trans. Pattern Anal. Mach. Intell.12, 447–458 (1990). [Google Scholar]

[CR37] 37.Smets, P. Data fusion in the transferable belief model. In Proceding of the Third International Conference on Information Fusion 10–13 (Citeseer, 2000).

[CR38] 38.Zadeh, L. A simple view of the dempster-shafer theory of evidence and its implication for the rule of combination. Artif. Intell.2, 85–90 (1986). [Google Scholar]

[CR39] 39.Haenni, R. Shedding new light on zadeh’s criticism of dempster’s rule of combination. In 8th International Conference on Information Fusion vol. 2, 6 pp 10.1109/ICIF.2005.1591951 (2005).

[CR40] 40.Magnier, V., Gruyer, D. & Godelle, J. Multi-criteria similarity operator based on the belief theory: Management of similarity, dissimilarity, conflict and ambiguities. In 2017 IEEE Intelligent Vehicles Symposium (IV) 1215–1221 (IEEE, 2017).

[CR41] 41.Appriou, A. Situation assessment based on spatially ambiguous multisensor measurements. Int. J. Intell. Syst.16, 1135–1166 (2001). [Google Scholar]

[CR42] 42.Jacquemart, A., Ieng, S.-S., Hadj-Bachir, M. & Gruyer, D. A new method for parametric bbf generation. In 2024 27th International Conference on Information Fusion (FUSION) 1–8, 10.23919/FUSION59988.2024.10706296 (2024).

PERMALINK

Fidelity assessment of synthetic images with multi-criteria combination under adverse weather conditions

Alexandra Duminil

Sio-Song Ieng

Dominique Gruyer

Abstract

Introduction

Related works

Methodology and features extraction

Fig. 1.

Datasets used in the study

Table 1.

Local and global texture feature extraction

Fig. 2.

Fig. 3.

High-frequency feature extraction

Haralick metrics

Architecture implementation and experiments

Learning-based method

Fig. 4.

Statistic-based method

Fig. 5.

Experimental results

Table 2.

Table 3.

Table 4.

Fig. 6.

Table 5.

Multi criteria combination with belief theory

Belief Theory for data combination

Fig. 7.

Mass-set generation

Fig. 8.

Fig. 9.

Multi-criteria combination rules

Implementation on datasets, results and analyzing

Fig. 10.

Fig. 11.

Conclusion and future works

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases