Abstract
Image stitching is a fundamental task in computer vision, enabling critical applications such as panoramic imaging, augmented reality (AR), and autonomous perception systems. However, existing stitching algorithms exhibit significant performance variations under real-world challenges–including illumination changes, noise interference, and geometric distortions–making reliable quality assessment difficult. To address this challenge, we introduce StitchEval, a comprehensive benchmark framework incorporating both objective and subjective assessment metrics, including structural similarity (SSIM), mean squared error (MSE), peak signal-to-noise ratio (PSNR) and human-rated subjective scores (SS). Additionally, we integrate a human-perception-based scoring system to enhance the comprehensiveness of the evaluation. By applying illumination transformations, noise interference, and geometric variations to the original dataset, we systematically analyze the robustness of different stitching algorithms. Furthermore, the proposed evaluation framework and dataset construction methodology are designed to be highly flexible, allowing seamless integration with other datasets to facilitate cross-dataset comparisons and broader benchmarking of stitching algorithms. The insights derived from this study provide a valuable reference for optimizing future stitching methods and improving algorithm adaptability in real-world scenarios.
Keywords: Image stitching, Benchmark framework, Quality assessment, Human-perception-based scoring system, Cross-dataset comparison
Subject terms: Computer science, Scientific data
Introduction
Image stitching is widely used in various fields, such as panoramic image generation1,2, autonomous driving3, Geographic Information Systems (GIS)4, and virtual reality5. Several publicly available datasets exist for image stitching, including Object-Centered Stitching Dataset6, SPW Dataset7, VPG Dataset8, Color Consistency Dataset9, UDIS-D10, and WSSN Dataset11. These datasets are diverse and contain raw images from multiple categories, such as houses, buses, and people. However, there is a lack of a comprehensive dataset covering multiple scenarios and various stitching methods, especially the lack of ability to systematically integrate multiple interference conditions (such as noise, illumination changes, geometric transformations) in quality assessment. In addition, existing evaluation methods mainly rely on simple objective indicators (such as MSE, PSNR), ignoring the relationship between human visual perception and algorithm robustness, resulting in deviations between evaluation results and user experience in real scenarios.
To address this gap, this paper proposes StitchEval with the following contributions:
A comprehensive stitching quality assessment dataset is constructed, which simulates real interference conditions through data augmentation techniques (noise injection, illumination adjustment, geometric transformation), covers 280 images, and provides multi-dimensional quality labels (SSIM, MSE, PSNR, manual rating).
A hybrid evaluation framework is proposed, which combines traditional objective indicators (SSIM, MSE, PSNR), human-rated subjective score (SS) and dynamic weighting strategy for the first time, significantly improving the comprehensiveness and interpretability of the evaluation.
The robustness of mainstream algorithms (SIFT, BRISK, AKAZE) is systematically analyzed to reveal their performance degradation laws under noise, illumination and geometric interference, providing empirical basis for algorithm selection in practical applications.
The scalability of the enhanced framework is supported to support seamless integration with other datasets (such as UDIS-D10), and promote cross-dataset benchmarking.
This study provides a standardized benchmark for image stitching quality assessment and points out directions for future algorithm optimization (e.g., noise-resistant design, dynamic scene adaptation).
Background and motivations
Image stitching
Image stitching usually includes the following steps: feature extraction and matching, geometric transformation calculation, image transformation and registration, and image fusion. Different methods differ in these steps, mainly including the following categories:
Feature point-based stitching: This type of method mainly relies on the matching of local feature points to calculate the transformation relationship between images. Common algorithms include SIFT (Scale-Invariant Feature Transform)12, BRISK (Binary Robust Invariant Scalable Keypoints)13, and AKAZE (Accelerated-KAZE)14. SIFT finds key points through Gaussian scale space and calculates the direction histogram as a descriptor, which has rotation, scale, and affine invariance. After matching the key points, RANSAC (Random Sample Consensus) is used to estimate the homography matrix for image registration. Compared to SIFT, BRISK is a binary descriptor-based algorithm that uses a scale-space pyramid for keypoint detection and applies a sampling pattern to compute a robust descriptor. It is computationally efficient and well-suited for real-time applications. AKAZE, on the other hand, is based on nonlinear diffusion filtering, making it more robust to noise and lighting changes. It offers good performance in terms of speed and accuracy while preserving scale and rotation invariance. These algorithms are suitable for image stitching tasks with large perspective changes and certain affine transformations. However, the stitching effect is poor for featureless areas (such as solid color backgrounds), and the computational complexity is still relatively high.
Direct stitching: The direct method does not rely on feature points, but directly minimizes the difference between image pixels. For example, the optical flow method15 and the phase correlation method16 directly minimize pixel differences and are suitable for scenes with large overlapping areas, but they are sensitive to lighting changes and have high computational cost.
Deep learning-based stitching: In recent years, deep learning methods have been used for image stitching, mainly for feature extraction, registration optimization and image fusion. For example, Deep Image Stitching10 and HomographyNet17 use CNN or Transformer to predict the homography matrix, and recent studies (such as Swin Transformer18) improve the stitching quality of complex scenes by modeling long-range dependencies. However, these methods rely on a large amount of training data and have limited generalization capabilities to extreme geometric distortions.
Motivations
Current research on image stitching algorithms mainly focuses on improving stitching accuracy and computational efficiency, while there is still little in-depth exploration of the systematic evaluation of stitching quality. Existing evaluation methods mostly rely on simple image similarity indicators, such as mean square error (MSE) and peak signal-to-noise ratio (PSNR). Although these indicators are computationally convenient, their limitation is that they are difficult to fully reflect human visual perception of stitching quality. For example, even if the MSE of the stitched image is low, artifacts such as unnatural boundaries at the seams, color inconsistency, or geometric distortion will still significantly reduce the perceived quality and affect the user experience in actual application scenarios. Specifically, the shortcomings of existing research can be summarized into the following three aspects:
Dataset limitations: Existing datasets (such as APAP19) lack coverage of real interference such as noise and illumination changes, which limits the performance verification of the algorithm in complex environments.
Single evaluation indicators: Most studies rely on pixel-level indicators such as MSE and PSNR, ignoring the importance of structural consistency (SSIM) and human eye perception. Recent attempts have introduced perceptual indicators such as LPIPS20 and VIF21, but their application in the field of splicing has not yet been systematized.
Lack of robustness evaluation under dynamic interference: Most existing evaluation methods focus on clean, static image pairs, and rarely assess algorithm robustness under real-world image-level disturbances such as noise, lighting variation, or geometric transformations. While dynamic scene stitching22 involving temporal continuity is important, the evaluation frameworks for such scenarios remain in early exploration. In this work, we take a step forward by simulating typical dynamic interference patterns within static image pairs and analyzing the performance degradation of popular algorithms.
To address the above challenges, this paper aims to fill the gaps through the following research:
Construct a multi-interference stitching dataset containing noise, illumination changes and geometric transformations to simulate the complexity of real scenes;
Design a hybrid evaluation framework that design a hybrid evaluation framework that integrates objective indicators (SSIM, PSNR, MSE) and human eye scores to enhance perceptual alignment;
Systematically analyze how mainstream algorithms (SIFT, BRISK, AKAZE) respond to typical image-level disturbances–such as noise, lighting variation, and rotation–in order to support algorithm adaptation for practical stitching tasks.
By integrating multi-dimensional evaluation and robustness analysis, this study provides a solution that is closer to actual needs for image stitching quality assessment, and lays the foundation for algorithm optimization in complex scenarios (such as noise-resistant design and enhanced spatiotemporal consistency).
Dataset construction
This study is based on the APAP dataset19 with 140 images in total, selecting high-quality left and right view images as raw data and applying various stitching algorithms to generate the final stitched results. The dataset encompasses a diverse range of image sets, including rail tracks, temples, parking lots, apartments, construction sites, and gardens. Figure 1 illustrates a representative sample from the APAP dataset, featuring left and right views of a building, house, railway section, and temple, demonstrating the variety of structures and scenes captured in the dataset.
Figure 1.
A set of images in APAP dataset.
Data expansion method
To comprehensively evaluate the robustness of different stitching algorithms, we introduced the following interference factors based on the original dataset:
Illumination Variation: Adjusted brightness, contrast, and hue to simulate stitching performance under different lighting conditions, ensuring the algorithm’s adaptability to varying exposure and shadows.
Noise Interference: Added multiple types of noise, including Gaussian noise, salt-and-pepper noise, and Poisson noise, to assess the algorithm’s resilience in handling degraded image quality and preserving structural integrity.
Geometric Transformations: Applied transformations such as scaling, rotation, and perspective distortion to evaluate the algorithm’s ability to align and stitch images with different viewpoint changes and geometric deformations.
Figure 2 illustrates the dataset expansion process based on the APAP dataset. The original left and right images, along with the reference image, undergo illumination adjustments, random rotations, and noise addition to simulate various real-world conditions. After these transformations, one of the three feature extraction and matching algorithms–SIFT, BRISK, or AKAZE–is randomly applied to generate the final stitched results.
Figure 2.
The data expansion workflow. Each original image pair undergoes different perturbation branches (rotation, illumination, noise), and one of the stitching algorithms (SIFT, BRISK, or AKAZE) is applied to generate the final stitched result.
This process results in four distinct stitching datasets: the normal dataset, the noise-augmented dataset, the illumination-adjusted dataset, and the rotation-transformed dataset. The original dataset consists of 140 images (70 image pairs). Each of the 70 image pairs is processed under four different transformation conditions to generate one stitched image per condition, resulting in 280 stitched images in total.
Evaluation metrics for image stitching quality
In order to comprehensively evaluate the quality of image stitching, we use two methods: objective evaluation indicators (Objective Metrics) and subjective evaluation indicators (Subjective Metrics). Objective evaluation indicators mainly measure the clarity, fidelity and smoothness of the stitching boundary of the stitched image, while subjective evaluation indicators combine human eye perception to score the quality. The following is a detailed description of each metric.
Objective metrics
The Structural Similarity Index Measure (SSIM) is a crucial metric for assessing the structural, brightness, and contrast similarities between two images. It ranges from −1 to 1, where 1 indicates identical images, while values near 0 or negative suggest significant distortion.
For stitched images, we compute SSIM to evaluate the perceptual quality by comparing the stitched result with the original complete image. The SSIM formula is as follows:
![]() |
1 |
where x, y represent the original and stitched images, respectively,
and
denote their mean intensities,
and
are their variances,
is their covariance. The constants
stabilize the computation. In image stitching, a higher SSIM value signifies better visual fidelity, while a lower SSIM suggests noticeable distortions, contrast variations, or misalignment at the seams.
In addition to SSIM, the Mean Squared Error (MSE) is used to quantify pixel-level discrepancies between the stitched and reference images. A lower MSE value indicates a smaller error and higher stitching quality. However, MSE does not reflect perceptual differences perceived by the human eye–it purely measures pixel-wise differences. The MSE formula is given by:
![]() |
2 |
where I (i, j) and R (i, j) are the pixel intensities of the stitched and reference images, respectively, and M and N represent the image dimensions. Since MSE is sensitive to illumination variations and noise, it is often used in conjunction with SSIM for a more comprehensive analysis.
Another key metric is the Peak Signal-to-Noise Ratio (PSNR), which assesses noise levels within the stitched regions, particularly in areas where unnatural transitions may occur. PSNR is computed as follows:
![]() |
3 |
where max(I) represents the maximum signal strength in the stitched image, and the denominator corresponds to the mean squared error. A higher PSNR value indicates lower noise and better stitching quality. PSNR is particularly useful for evaluating artifacts caused by lighting inconsistencies, blur, or compression artifacts.
Subjective metrics
While objective metrics (SSIM, MSE, PSNR) quantify numerical differences, they do not always align with human visual perception. To bridge this gap, we introduce Subjective Scores (SS), where experts or users manually rate the stitching quality using a 5-point scale, as shown in Table 1. In our subjective evaluation, we invited 10 participants (5 with computer vision background and 5 laypersons) to rate the stitching quality. Each participant independently scored the stitched images across four interference categories (normal, noise, illumination change, and rotation), and the final SS is computed by averaging all participant ratings. The scoring was conducted in a controlled environment with standardized display settings to minimize bias.
Table 1.
A 5-point system of subjective scores.
| Subjective scores | Stitching quality |
|---|---|
| 1 | Very poor (serious misalignment, obvious stitching gap) |
| 2 | Poor (obvious misalignment, but the content can still be recognized) |
| 3 | Average (slight misalignment, visible stitching marks) |
| 4 | Good (basically no misalignment, smooth stitching) |
| 5 | Excellent (completely no misalignment, almost no difference from the original image) |
Comprehensive scoring system
To provide a holistic assessment of image stitching quality, we propose a weighted fusion scoring method that integrates objective metrics (SSIM, PSNR, MSE) with subjective scores (SS). The overall quality score Q is computed as follows:
![]() |
4 |
where SSIM′, PSNR′, MSE′ and SS′ are normalized scores using min-max scaling to range between [0, 1]. Since a higher MSE indicates worse quality, its normalized value is inverted to ensure that larger scores consistently reflect better stitching quality. The weighting coefficients
,
,
,
influence the final score. In this study, we empirically set the weighting parameters in Formula (4) as:
= 0.3,
= 0.25,
= 0.15, and
= 0.3, ensuring that the sum equals 1. These values reflect the relative importance of structural similarity (SSIM), noise level (PSNR), pixel-wise error (MSE), and perceptual judgment (SS). Among them, SSIM and subjective score (SS) were assigned higher weights due to their stronger alignment with visual perception, while MSE–being sensitive but less interpretable–received a lower weight. This balance was validated through sensitivity tests and manual inspection.
Alternatively, machine learning techniques such as regression analysis, random forests, or neural networks can be applied to dynamically optimize these weights based on experimental data, ensuring better alignment with human perception. Moreover, different applications may emphasize distinct metrics. For example: In medical image stitching, MSE may be more critical due to the need for precise anatomical alignment, while in natural scene stitching, SSIM might be prioritized for structural coherence. To address this variability, an adaptive weighting strategy can be implemented, adjusting weight distributions based on specific task requirements. Finally, the comprehensive score Q can be used to classify stitching quality into categories:
: excellent quality (seamless, high-quality stitching);
: good quality (minor artifacts, but overall acceptable);
: fair quality (visible artifacts, but retains structural integrity)
: poor quality (significant errors, evident misalignment)
This scoring framework not only facilitates quantitative evaluation but also provides actionable insights for refining and optimizing image stitching algorithms.
Experimental results
During the experiment, we evaluated the quality of 3 popular image stitching methods (SIFT, BRISK, AKAZE) based on structural similarity (SSIM), peak signal-to-noise ratio (PSNR), mean square error (MSE) and subjective score (SS), and used the weighted comprehensive score Q for final quality judgment. These three methods were selected due to their wide adoption and well-understood characteristics as classical feature-based algorithms. Their distinct detection and description schemes allow for interpretable robustness analysis under diverse perturbations. Moreover, they are commonly used as baselines in both academic literature and practical applications, making them ideal candidates for benchmarking in the StitchEval framework. While our focus is on establishing a flexible evaluation methodology, the framework is extensible and can incorporate deep learning-based approaches, which we consider as part of future work. The experimental dataset contains a variety of typical stitching scenes, such as natural scenery, urban buildings, indoor scenes, and introduces different degrees of noise, illumination changes and other interference to test the robustness of the algorithm. To facilitate comparison between different indicators, we normalized MSE, PSNR, SSIM and comprehensive score Q using Min-Max normalization as the following shows:
![]() |
5 |
where X’ is is the normalized value,
and
are the minimum and maximum values of the indicator in all experimental data respectively.
Expanded dataset
To comprehensively evaluate the performance of different stitching algorithms under various real-world conditions, we expanded the original APAP dataset by introducing a stitched-image dataset incorporating multiple transformations and distortions. This dataset is generated by randomly combining three stitching methods (SIFT, BRISK, and AKAZE) and applying various perturbations, ensuring diverse testing scenarios. Specifically, we introduce the following modifications to the dataset:
Illumination changes: Adjusting brightness, contrast, and hue to simulate different lighting conditions.
Noise interference: Adding Gaussian noise, salt-and-pepper noise, and Poisson noise to evaluate robustness under low-quality images.
Geometric transformations: Applying rotations, scaling, and perspective distortions to assess adaptability to viewpoint variations.
The expanded dataset contains four distinct stitching categories: the normal data set, the noise-augmented data set, the illumination-adjusted data set,et, and the rotation-transformed dataset. The original dataset consists of 140 images (70 image pairs), and after expansion, the total number of images across the four datasets increases to 280, with each dataset containing 70 stitched images. Figure 3 provides a visual representation of the enhanced dataset, showcasing stitched results under different conditions, including illumination variations, noise interference, and geometric transformations. These modifications create a challenging and diverse benchmark, facilitating a more comprehensive evaluation of stitching quality across different environmental conditions.
Figure 3.
A set of images of the expanded dataset.
MSE error analysis
MSE measures the error between the stitched image and the original image. The smaller the value, the better the stitching quality. The following Table 2 shows the MSE normalization results of different methods under normal stitching, noise interference, illumination change, and rotation transformation:
Table 2.
Normalized results of MSE with different stitching methods.
| Method | Normal stitching | Noise interference | Light change | Rotation transformation |
|---|---|---|---|---|
| SIFT | 0.24 | 1.00 | 0.82 | 0.76 |
| BRISK | 0.32 | 0.95 | 0.78 | 0.70 |
| AKAZE | 0.48 | 0.88 | 0.72 | 0.65 |
Based on the above results, noise interference has the greatest impact on MSE, and the MSE of the SIFT method increases most significantly. The MSE of the AKAZE method is relatively high, and the overall stitching error is greater than that of SIFT and BRISK. The rotation transformation has little impact, and the MSE change is relatively stable.
PSNR analysis
PSNR measures signal quality, and a larger value indicates a higher stitching quality. Table 3 denotes the normalized PSNR results for different methods: The SIFT method has the highest PSNR (normalized to 1.00) under normal stitching, and the stitching quality is the best. Noise interference has the greatest impact on PSNR, and the PSNR decreases significantly. The AKAZE method has the smallest decrease in PSNR under illumination changes, indicating that it has good illumination robustness.
Table 3.
Normalized results of PSNR with different stitching methods.
| Method | Normal stitching | Noise interference | Light change | Rotation transformation |
|---|---|---|---|---|
| SIFT | 1.00 | 0.34 | 0.51 | 0.60 |
| BRISK | 0.92 | 0.42 | 0.60 | 0.68 |
| AKAZE | 0.85 | 0.50 | 0.72 | 0.75 |
SSIM analysis
SSIM measures structural similarity. The larger the value, the closer the stitching quality is to the original image. The following Table 4 represents the SSIM normalized result. Based on the results, the SIFT method has the highest SSIM (normalized to 1.00), but is most affected by noise. The AKAZE method has less SSIM drop under illumination and rotation transformations, indicating that it is more adaptable to illumination and rotation. The SSIM of the BRISK method is relatively stable in all cases and performs well.
Table 4.
Normalized results of SSIM with different stitching methods.
| Method | Normal stitching | Noise interference | Light change | Rotation transformation |
|---|---|---|---|---|
| SIFT | 1.00 | 0.30 | 0.52 | 0.60 |
| BRISK | 0.94 | 0.38 | 0.60 | 0.70 |
| AKAZE | 0.89 | 0.48 | 0.72 | 0.78 |
Comprehensive score Q normalization analysis
Due to the incorporation of both subjective scores based on human perception and multiple objective image quality evaluation metrics (SSIM, PSNR, and MSE), the weighted score Q provides a comprehensive measure of the final stitching quality. A higher Q value indicates better perceptual and structural stitching performance. Table 5 presents the normalized Q scores. As shown, the SIFT method achieves the highest score (normalized to 1.00) under normal stitching conditions, reflecting its superior performance in ideal scenarios. However, it is also the most sensitive to noise interference, with its Q score dropping to 0.06. In contrast, the AKAZE method demonstrates stronger robustness under illumination changes and rotational transformations, achieving the highest scores in these categories, indicating stable performance across challenging perturbations.
Table 5.
Normalized results of comprehensive score Q with different stitching methods.
| Method | Normal stitching | Noise interference | Light change | Rotation transformation |
|---|---|---|---|---|
| SIFT | 1.00 | 0.06 | 0.88 | 0.66 |
| BRISK | 0.96 | 0.17 | 0.48 | 0.90 |
| AKAZE | 0.80 | 0.08 | 0.43 | 0.98 |
Comparative analysis of different stitching methods
SIFT: Fig. 4 shows original images and related SIFT-based stitching images with interference factors, and related metrics are shown in Table 6. The advantages of the SIFT method are that it has the best quality of ordinary stitching, the lowest MSE, the highest SSIM, strong adaptability to illumination changes, and a small decrease in PSNR. However, its disadvantages are also obvious: it is sensitive to noise, and the stitching quality is most seriously degraded. In a high-noise environment, the lowest Q value is only 0.05.
Figure 4.
Stitching image with SIFT.
Table 6.
Normalized MSE, PSNR, SSIM, SS and comprehensive
for Fig. 4 (SIFT).
| Conditions | MSE′
|
PSNR′
|
SSIM′
|
SS′
|
Q
|
|---|---|---|---|---|---|
| Normal stitching | 0.000 | 1.000 | 1.000 | 1.000 | 0.850 |
| Stitching with noise interference | 1.000 | 0.032 | 0.455 | 0.200 | 0.055 |
| Stitching with light change | 0.000 | 1.000 | 1.000 | 0.652 | 0.745 |
| Stitching with geometric transformation | 0.236 | 0.605 | 0.824 | 0.666 | 0.563 |
BRISK: Fig. 5 shows original images and related BRISK-based stitching results under different interference factors, and the corresponding metrics are summarized in Table 7. Overall, BRISK demonstrates strong robustness to illumination changes and geometric transformations, with relatively small stitching errors and stable performance across different scenarios. Compared to SIFT, BRISK has a slightly lower SSIM under normal conditions, indicating marginally reduced stitching accuracy. However, it outperforms SIFT under noise interference, with a higher Q score (0.148 vs. 0.055), reflecting improved stability. Due to moderately low subjective scores in some cases, BRISK’s comprehensive Q scores remain at a good level but do not reach the highest benchmark.
Figure 5.
Stitching image with BRISK.
Table 7.
Normalized MSE, PSNR, SSIM, SS and comprehensive
for Fig. 5 (BRISK).
| Conditions | MSE′
|
PSNR′
|
SSIM′
|
SS′
|
Q
|
|---|---|---|---|---|---|
| Normal stitching | 0.021 | 0.952 | 0.923 | 1.000 | 0.812 |
| Stitching with noise interference | 0.850 | 0.200 | 0.250 | 0.500 | 0.148 |
| Stitching with light change | 0.500 | 0.450 | 0.813 | 0.410 | 0.404 |
| Stitching with geometric transformation | 0.150 | 0.827 | 0.939 | 1.000 | 0.766 |
AKAZE: Fig. 6 shows original images and corresponding AKAZE-based stitching results with various perturbations, and detailed metrics are provided in Table 8. AKAZE exhibits relatively strong performance under geometric transformations, achieving the highest Q score (0.829) among all three methods in that category. While its SSIM under illumination changes is slightly lower than BRISK’s (0.714 vs. 0.813), it maintains competitive visual quality and consistent subjective scores. However, AKAZE generally produces higher MSE values and exhibits moderate robustness to noise interference, with a Q score (0.067) comparable to SIFT but lower than BRISK. These results suggest that AKAZE is particularly suitable for scenarios involving perspective or rotational variations, while its performance under severe noise requires further enhancement.
Figure 6.
Stitching image with AKAZE.
Table 8.
Normalized MSE, PSNR, SSIM, SS and comprehensive
for Fig. 6 (AKAZE).
| Conditions | MSE′
|
PSNR′
|
SSIM′
|
SS′
|
Q
|
|---|---|---|---|---|---|
| Normal stitching | 0.162 | 0.769 | 0.790 | 0.911 | 0.678 |
| Stitching with noise interference | 0.850 | 0.150 | 0.200 | 0.323 | 0.067 |
| Stitching with light change | 0.595 | 0.311 | 0.714 | 0.530 | 0.362 |
| Stitching with geometric transformation | 0.000 | 1.000 | 1.000 | 0.930 | 0.829 |
Figure 7 illustrates the distribution of stitching quality (Q) scores for three feature point matching methods–SIFT, BRISK, and AKAZE–under different interference conditions: Normal, Noise, Light, and Trans, averaged across all image pairs in the dataset. The results indicate that the SIFT method achieves the highest Q score of 0.85 under normal conditions, but its performance deteriorates significantly under noise interference, dropping to 0.35, making it the most sensitive to noise. The BRISK method demonstrates the best performance under illumination changes (Light) and rotational transformations (Trans), with Q scores of 0.70 and 0.60, respectively, exhibiting strong overall stability. Similarly, AKAZE proves to be the most robust against illumination changes (Light, 0.68) and rotational transformations (Trans, 0.65), showing the least sensitivity to interference. Among the tested interference types, noise exhibits a pronounced degradation in quantitative metrics such as PSNR and SSIM across all methods. While PSNR is inherently sensitive to pixel-level fluctuations introduced by noise, the proposed evaluation framework incorporates SSIM and subjective ratings (SS), which more accurately reflect perceptual stitching quality–including alignment precision and ghosting artifacts. Accordingly, the observed decline in Q scores under noise interference reflects both structural distortion and perceptual degradation. Notably, SIFT is most sensitive to noise, while BRISK and AKAZE maintain relatively stable alignment under such conditions.
Figure 7.
Final Q-score distribution of different methods under four perturbations.
Related works
Image stitching is a widely researched topic in computer vision, with applications in panoramic image generation1, autonomous driving3 and virtual reality5. Over the years, various methods have been developed to improve stitching accuracy, efficiency, and robustness under different conditions.
Feature-based stitching methods: Feature-based approaches, such as SIFT12, SURF23, ORB24, and AKAZE14, detect and match keypoints between overlapping images. These methods are robust to scale, rotation, and illumination changes, making them suitable for general-purpose stitching. However, these methods still suffer from alignment errors when dealing with low-texture regions.
Direct pixel-based methods: Pixel-based approaches, such as gradient-domain blending25 and Poisson image editing26, minimize intensity differences between stitched regions by performing global optimization. While these methods improve visual consistency, they are computationally expensive and sensitive to illumination and exposure variations.
Deep learning-based methods: Recent advancements in deep learning have introduced data-driven image stitching27, where convolutional neural networks (CNNs) and transformers learn to predict seamless image transitions. Methods such as Deep Image Stitching (DIS) and UDIS-D10 use large-scale datasets to refine feature extraction and blending. However, these approaches often require significant training data and struggle with extreme geometric distortions. In recent years, Transformer-based models (such as Swin Transformer18) have improved the stitching quality of complex scenes by modeling long-range dependencies; in addition, video stitching methods in dynamic scenes (such as the VSPW dataset22) have begun to focus on the impact of temporal consistency on quality assessment. However, these methods usually require a large amount of training data and have difficulty dealing with extreme geometric distortions.
Evaluation metrics in image stitching: Most previous studies use simple metrics such as MSE and PSNR to evaluate stitching quality, which mainly measure pixel differences. However, these metrics do not fully reflect the perceived quality of humans, especially when structural artifacts are introduced at the stitching boundaries. Some studies have combined SSIM to quantify perceptual similarity, while others have introduced subjective evaluations to achieve more human-friendly evaluations5. Recent work has attempted to introduce more complex perceptual metrics such as LPIPS (perceptual similarity based on deep features)20 and VIF (visual information fidelity)21, but the application of these methods in the field of image stitching has not been systematic. In addition, existing studies have mostly focused on static images, and the evaluation framework for dynamic scenes (such as video stream stitching) is still in the exploratory stage28.
Motivation and contribution: While existing datasets and evaluation methods provide valuable insights, they often lack diverse scene coverage, quality annotations, and robustness testing under different interference conditions. This study addresses these gaps by:
Dataset diversity: Compared with existing datasets (such as SPW7, VPG8), the dataset constructed in this paper systematically integrates illumination changes, noise interference and geometric transformation for the first time, covers a wider range of real scenes, and provides multi-dimensional quality labels (SSIM, MSE, PSNR, Subjective Scores).
Hybrid evaluation framework: Existing studies mostly rely on a single type of indicator (such as only objective or only subjective), while the framework proposed in this paper combines traditional indicators, perceptual indicators (SSIM) and human eye ratings, and optimizes the comprehensive score through a dynamic weighting strategy, significantly improving the comprehensiveness of the evaluation.
Robustness analysis: Existing work (such as19) mainly focuses on the performance of algorithms under ideal conditions, while this paper systematically reveals for the first time the performance degradation laws of SIFT, BRISK, and AKAZE under noise, illumination and geometric interference, providing an empirical basis for algorithm selection in practical applications.
Scalability: The dataset construction method and evaluation framework proposed in this paper support seamless integration with external datasets (such as UDIS-D10), facilitating cross-dataset comparison, while existing research is mostly limited to the evaluation of a single data source.
This work aims to establish a benchmark for evaluating stitching quality, providing a standardized dataset and scoring methodology for future research.
Conclusion
We present StitchEval, an innovative and scalable benchmarking framework for comprehensive evaluation of image stitching algorithms under realistic operating conditions. The framework incorporates three critical perturbation categories: (1) illumination variations (exposure/brightness changes), (2) multi-source noise interference (Gaussian, salt-and-pepper, and sensor noise), and (3) non-linear geometric transformations (perspective warps and projective distortions). Through systematic experimentation with representative algorithms (SIFT, BRISK, AKAZE), StitchEval reveals several key findings:
SIFT achieves the highest quality in ideal conditions but is highly sensitive to noise.
BRISK exhibits strong robustness against illumination and rotational variations.
AKAZE maintains stable performance across transformations, making it suitable for challenging scenes.
Noise interference leads to noticeable perceptual degradation across all methods, particularly in structural integrity and edge consistency, underscoring the importance of incorporating denoising or robustness-enhancing mechanisms in stitching pipelines.
StitchEval establishes several important contributions to the field:
The first standardized benchmark specifically designed for robustness evaluation in image stitching.
A modular architecture supporting seamless integration with existing datasets (e.g., through API-based extensions).
Quantitative evidence of algorithm-specific failure modes under controlled perturbations.
StitchEval provides a standardized benchmark for future image stitching research, offering insights into algorithmic strengths and weaknesses across diverse conditions. Moreover, StitchEval combines both objective and subjective indicators to ensure that core stitching quality aspects–such as alignment accuracy and ghosting minimization–are not overshadowed by pixel-level variations. Designed to be highly adaptable, the evaluation framework and dataset construction methodology can be seamlessly integrated with existing datasets, enabling comprehensive cross-dataset comparisons and facilitating broader benchmarking of stitching algorithms. Future work may explore deep learning-based enhancement techniques, adaptive feature selection strategies, and the incorporation of temporal consistency in video stitching applications. Additionally, incorporating recent deep learning-based stitching models such as HomographyNet, Deep Image Stitching (DIS), SuperGlue and Transformer-based methods into the StitchEval pipeline will further demonstrate the framework’s generality and benchmarking potential across both classical and learning-based paradigms.
Author contributions
Conceptualization, Yiding Liu and Yanyong Wang; methodology, Di Wang; software, Yiding Liu; validation, Wentao Wu; formal analysis,Xingxin Li; investigation, Weichen Sun; resources, Xuebing Ren; data curation, Xuebing Ren, Haiping Song; writing—original draft preparation, Yiding Liu; writing—review and editing,Yiding Liu, Yanyong Wang; visualization, Yiding Liu; supervision, Haiping Song; project administration, Yanyong Wang, Haiping Song; funding acquisition, Haiping Song.
Funding
No Funding.
Data availability
The data that support the findings of this study are available from the corresponding author, Dr. Yiding Liu, upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Alomran, M. & Chai, D. Feature-based panoramic image stitching. In 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV) 1–6 (IEEE, 2016).
- 2.Abbadi, N. K. E., Al Hassani, S. A. & Abdulkhaleq, A. H. A review over panoramic image stitching techniques. J. Phys. Conf. Ser.1999, 012115 (2021). [Google Scholar]
- 3.Wang, L., Yu, W. & Li, B. Multi-scenes image stitching based on autonomous driving. In 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) Vol. 1, 694–698 (IEEE, 2020).
- 4.Wang, J., Du, P., Yang, S., Zhang, Z. & Ning, J. A spatial arrangement preservation based stitching method via geographic coordinates of UAV for farmland remote sensing image. IEEE Trans. Geosci. Remote Sens.62, 1–13 (2024). [Google Scholar]
- 5.Madhusudana, P. C. & Soundararajan, R. Subjective and objective quality assessment of stitched images for virtual reality. IEEE Trans. Image Process.28, 5620–5635 (2019). [DOI] [PubMed] [Google Scholar]
- 6.Herrmann, C., Wang, C., Bowen, R. S., Keyder, E. & Zabih, R. Object-centered image stitching. In Proceedings of the European Conference on Computer Vision (ECCV) 821–835 (2018).
- 7.Liao, T. & Li, N. Single-perspective warps in natural image stitching. IEEE Trans. Image Process.29, 724–735 (2019). [DOI] [PubMed] [Google Scholar]
- 8.Chen, K. et al. Vanishing point guided natural image stitching. arXiv preprint arXiv:2004.02478 (2020).
- 9.Xia, M., Yao, J. & Gao, Z. A closed-form solution for multi-view color correction with gradient preservation. ISPRS J. Photogram. Remote Sens.157, 188–200 (2019). [Google Scholar]
- 10.Nie, L., Lin, C., Liao, K., Liu, S. & Zhao, Y. Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Trans. Image Process.30, 6184–6197 (2021). [DOI] [PubMed] [Google Scholar]
- 11.Song, D. -Y., Lee, G., Lee, H., Um, G. -M. & Cho, D. Weakly-supervised stitching network for real-world panoramic image generation. In European Conference on Computer Vision 54–71 (Springer, 2022).
- 12.Lowe, D. G. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision Vol. 2, 1150–1157 (IEEE, 1999).
- 13.Leutenegger, S., Chli, M. & Siegwart, R. Y. Brisk: Binary robust invariant scalable keypoints. In 2011 International Conference on Computer Vision 2548–2555 (IEEE, 2011).
- 14.Alcantarilla, P. F. & Solutions, T. Fast explicit diffusion for accelerated features in nonlinear scale spaces. IEEE Trans. Patt. Anal. Mach. Intell34, 1281–1298 (2011). [Google Scholar]
- 15.Sharmin, N. & Brad, R. Optimal filter estimation for Lucas–Kanade optical flow. Sensors12, 12694–12709 (2012). [Google Scholar]
- 16.Xu, M., Liu, X. & Wan, C. Real-time stitching algorithm of vehicle side view image based on multi-region fast phase correlation. IEEE Access10.1109/ACCESS.2024.3525181 (2025). [Google Scholar]
- 17.DeTone, D., Malisiewicz, T. & Rabinovich, A. Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016).
- 18.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
- 19.Zaragoza, J., Chin, T. -J., Brown, M. S. & Suter, D. As-projective-as-possible image stitching with moving DLT. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2339–2346 (2013).
- 20.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 586–595 (2018).
- 21.Sheikh, H. R. & Bovik, A. C. Image information and visual quality. IEEE Trans. Image Process.15, 430–444 (2006). [DOI] [PubMed] [Google Scholar]
- 22.Miao, J. et al. Vspw: A large-scale dataset for video scene parsing in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4133–4143 (2021).
- 23.Bay, H., Tuytelaars, T. & Van Gool, L. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9 404–417 (Springer, 2006).
- 24.Rublee, E., Rabaud, V., Konolige, K. & Bradski, G. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision 2564–2571 (IEEE, 2011).
- 25.Levin, A., Zomet, A., Peleg, S. & Weiss, Y. Seamless image stitching in the gradient domain. In Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part IV 8 377–389 (Springer, 2004).
- 26.Pérez, P., Gangnet, M. & Blake, A. Poisson image editing. Semin. Graph. Pap. Push. Bound.2, 577–582 (2023). [Google Scholar]
- 27.Nie, L., Lin, C., Liao, K., Liu, S. & Zhao, Y. Parallax-tolerant unsupervised deep image stitching. In Proceedings of the IEEE/CVF International Conference on Computer Vision 7399–7408 (2023).
- 28.Bai, X., Wang, Y., Zhang, C. & Hu, H. Improved algorithm based optimal stitching line in video stitching quality. In 2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS) 306–309 (IEEE, 2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author, Dr. Yiding Liu, upon reasonable request.



























