Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 28;15:37684. doi: 10.1038/s41598-025-21602-5

Efficient image matching for UAV visual navigation via DALGlue

Hansheng Zhang 1, Hao He 1,2,, Yatong Zhou 1,2, Fan Lei 3,
PMCID: PMC12569080  PMID: 41152462

Abstract

With the rapid advancement of UAV, image matching algorithms for visual navigation have been increasingly widely used. However, extensive research demonstrate that existing algorithms exhibit insufficient matching accuracy and long response time in challenging aerial scenes, which cannot meet the accuracy requirements of UAV visual navigation. In this paper, we propose an efficient image matching method for UAV visual navigation, named as DALGlue, based on convolutional neural network feature extraction algorithm and feature matching network with linear attention mechanism. DALGlue uses dual-tree complex wavelet transform to preprocess the collected aerial images, which enhances structural information and fine details. Compared with directly processing raw images, dual-tree complex wavelet transform module solves the problem of edge blurring in UAV dynamic flight. Then, an adaptive spatial feature fusion module is developed to extract features from images and calculate feature points and descriptors. In addition, we employ linear attention mechanism to aggregate image features, which can effectively reduce computational costs while improving network characteristics. Finally, the Sinkhorn algorithm is used to calculate the allocation matrix and output optimal assignment. DALGlue demonstrates a unique balance between accuracy and real-time performance, which can be operate under strict computational and memory constraints. In comparison to the state-of-the-art method LightGlue, the experimental results show that DALGlue obtains 11.8% points improvement in MMA. On the MegaDepth-1500 benchmark, DALGlue achieves the AUC@5 °/10 °/20 ° values of 57.01, 73.00, and 84.11 respectively, which effectively improved match precision.

Keywords: Unmanned aerial vehicles (UAVs), Visual navigation, Image matching, Deep neural network

Subject terms: Engineering, Mathematics and computing

Introduction

Unmanned Aerial Vehicles (UAVs) have gained significant traction in industrial inspection, disaster response and infrastructure maintenance1,2, owing to their operational flexibility and cost efficiency. A critical enabler for UAV autonomy lies in robust navigation systems. Traditional satellite-dependent solutions, however, face signal degradation in urban canyons or hostile environments3, necessitating alternative approaches like vision-based navigation4. In recent years, visual navigation technology based on image matching has become a new research direction5. Under satellite denial conditions, visual navigation exhibits significant advantages over traditional satellite navigation6. Unlike Global Positioning System(GPS), visual navigation based on image matching demonstrates superiority in obstacle-rich unknown terrains7. It can offer more reliable and efficient technical support for UAV flight, thus possessing an extremely broad prospect for research and application8.

In the field of image matching, scholars have extensively studied and used traditional feature point detection algorithms, such as SIFT, ORB, FAST, and SURF. These algorithms have limited anti-interference ability and high error rates in the presence of light interference or large angle rotation. In addition, for large and delicate image matching tasks, traditional feature extraction algorithms cannot aggregate contextual information well9.

With the rapid development of deep learning technology and computer vision, neural network-based image matching algorithms have gradually become mainstream and are significantly superior to traditional algorithms in many challenging scenarios. LoFTR10 is a detector-free matching method that introduces a coarse-to-fine correspondence prediction paradigm. Its advantage is that it directly extracts and matches features in image pairs without first detecting feature points, which makes it perform well in low-texture areas. However, LoFTR suffers from high computational and memory demands, making it less suitable for real-time deployment on resource-constrained platforms such as UAV. Furthermore, LoFTR also exhibits reduced generalization performance on datasets outside the training set. In 2018, Daniel DeTone et al. designed MagicPoint11, which uses the deep neural network VGG12 to construct a lightweight13,14 local feature point detection algorithm, providing a new idea for subsequent researchers to use deep learning models for image feature points extraction and matching. In the same year, DeTone et al. presented SuperPoint15 based on MagicPoint. The core of SuperPoint is to train feature points and feature descriptors in a self-supervised manner without manually annotating data. The algorithm uses homography transformation matrices12 to distort images to simulate scenes of different angles and scales, and has good performance. However, early image matching algorithms based on deep learning require large computing resources and time, and exhibit poor performance in multi-scale scenarios. In 2020, Sarlin et al. used the SuperGlue16 algorithm for image matching. The algorithm uses self-attention and cross-attention mechanisms to significantly improve the accuracy and robustness of matching. However, its computational complexity makes it unsuitable for deployment on UAVs. Lindenberger et al. designed the LightGlue17 algorithm in 2023. The innovation of LightGlue is to introduce a classifier for pruning after each feature matching layer, which can reduce the number of layers according to the difficulty of the input image pair and prune unmatched points. It is more efficient in terms of memory and computation, reducing the computational resources required for training. The combination of SuperPoint and LightGlue is also the main image feature point extraction and matching algorithm currently used18. Visual navigation based on image matching excels in obstacle-rich unknown terrains but requires highly reliable image matching algorithm—a task challenged by texture scarcity and computational latency. For UAV visual navigation, image matching plays a pivotal role in enabling tasks such as visual odometry, simultaneous localization and mapping (SLAM), and geo-localization. Unlike general image matching tasks, navigation demands not only high accuracy but also exceptional robustness and real-time performance. First, accuracy is paramount as small cumulative errors in matching can lead to significant drift in estimated pose or position19, impacting flight stability and mission success. Furthermore, real-time performance is essential because onboard systems usually work under strict computational and memory constraints, with a long response time potentially leading to delays in obstacle avoidance decisions. Therefore, it is crucial to design image matching methods specifically for UAV visual navigation.

In this paper, we propose a novel image matching method for UAV visual navigation, named as DALGlue, based on convolutional neural network feature extraction algorithm and feature matching network with linear attention mechanism which can further improve the matching accuracy and response speed of UAV aerial images in challenging aerial scenes. The main innovations of our method are as follows:

  1. We introduce the dual-tree complex wavelet transform (DTCWT) in the image preprocessing section to process the UAV captured images. This is to tackle the issue of unclear edges of ground objects in aerial images during rapid movement or rotation. Enhancing the high-frequency components while keeping the low-frequency components unchanged can achieve better image matching performance.

  2. We employ the adaptive spatial feature fusion module (ASFF), a plug-and-play one, to enhance the model’s feature perception ability. It can perform weighted fusion of information from different feature maps, thus improving the model’s generalization ability and performance.

  3. We replace the traditional dot-product attention mechanism with the linear attention mechanism in the feature matching model. The linear attention mechanism outperforms its traditional counterpart in aggregating contextual information and has a substantially lower computational complexity, thus significantly enhancing the method’s response speed.

Methods and materials

In order to improve the matching accuracy and response speed of UAV visual navigation in challenging aerial scenes, this paper proposes the DALGlue method. The proposed architecture is illustrated in Fig. 1. DALGlue consists of three parts: a preprocessing module using dual-tree complex wavelet transform, an image feature extraction module, and a local feature matching module based on transformer20. Detail information of each part will be introduced in the following.

Fig. 1.

Fig. 1

The paradigm of DALGlue (image sourced from the university-165221 dataset).

Preprocessing module

We employ the dual-tree complex wavelet transform (DTCWT) to process aerial images. This approach is adopted based on the distribution characteristics of feature points in these images. Most of the feature points in an image are concentrated at the boundaries where pixels change significantly. In aerial images, keypoints tend to cluster at boundaries-specifically at the interfaces between buildings and the sky/ground, as well as within building structures where abrupt changes in texture or geometry occur. The frequency components corresponding to these boundaries are high-frequency components, while the frequency components corresponding to the homogeneous regions of the building are low-frequency components. The DTCWT is widely used in the field of remote sensing. For instance, Sebai et al.22 utilized DTCWT-derived features for improved satellite image retrieval, Ma et al used DTCWT to highlight ship wakes in SAR data. Leveraging the DTCWT, we can increase the high-frequency components, which represent edge information, while keeping the low-frequency components, which signify the overall structure of the building unchanged.

Wavelet theory integrates mathematics and signal processing23, and is a multi-scale, multi-resolution localized analysis method. Discrete Wavelet Transform (DWT) is often used to process signals, and has shown excellent performance in image denoising, image restoration, and resolution enhancement. However, DWT can only provide high-frequency subbands in the horizontal, vertical, and diagonal directions, lacking directionality, at the same time, DWT does not have translation invariance, and a slight change in the input signal will cause a large fluctuation in the wavelet coefficients, which is very unfavorable for the extraction of image signal features.

The DTCWT can be regarded as an improved version of the wavelet transform, which was proposed by Kingsbury et al. of Cambridge University in 1998. Based on the basic characteristics of wavelet transform such as multi-resolution analysis, the DTCWT has been expanded to address some limitations of wavelet transform. The definition of one-dimensional complex wavelet is as follows:

graphic file with name d33e358.gif 1

where Inline graphic and Inline graphic represent the real and imaginary parts of the complex wavelet, respectively, and are both real functions. The structure of the DTCWT is shown in Fig. 2:

Fig. 2.

Fig. 2

The structure of DTCWT.

It can be seen that six high-frequency subbands describe the detailed information in the directions of ± 15°, ± 45°, and ± 75°, respectively, which better reflects the changes in different directions of the image at multiple scales, which is called directional selectivity. In addition, DTCWT also has shift invariance, which can more stably extract the features of signals or images. Taking advantage of the above excellent characteristics, we use DTCWT to process UAVs aerial images to enhance the edge of the ground objects and improve the matching accuracy.

Local feature extractor module

In the local feature extraction module, we apply a CNN-based24 network architecture to extract feature points and descriptors. For the encoder component, we introduce a new feature fusion strategy to aggregate context information, which is adaptive spatial feature fusion (ASFF). The network structure diagram is shown in the Fig. 3:

Fig. 3.

Fig. 3

Illustration of the ASFF.

In the field of computer vision, images of different scales usually contain different information. Shallow features often have higher resolution, contain more location and detail information, and have lower semantics and more noise due to fewer convolutional layers. High-level features generally have stronger semantic information, but have low resolution and poor perception of details. How to efficiently fuse feature map information of different scales is one of the challenges facing the image processing industry. Feature Pyramid Network (FPNs) is one of the commonly used methods to solve scale transformation in target detection. FPNs can obtain fixed-size output for inputs of different sizes, solve the problem of convolutional neural network for repeated feature extraction of graph-related features, and save computational costs25. At the same time, in the field of semantic segmentation, the use of FPNs can aggregate features of different regions, fully explore global context information, and enhance the expression ability of global features26. However, a major disadvantage of feature pyramid is its inconsistency at different scales, especially for one-shot detectors. Specifically, when an object is considered positive in a certain layer of feature map, the corresponding area in the feature map of other levels is considered as background. If an image contains both small objects and large objects, the conflict between features at different levels often occupies the main part of the feature pyramid. This inconsistency interferes with the gradient calculation during training and reduces the effectiveness of the feature pyramid.

To solve the above problems, we introduce the ASFF module for the encoder part. ASFF module spatially filters the conflicting information to suppress inconsistencies, so as to improve the scale invariance of features. It fuses the features of different layers by learning the weight parameters so that each spatial location can adaptively select the most useful features for prediction27. Specifically, for a certain level of feature, first integrate and adjust the features of other levels to the same resolution, and then find the optimal fusion through training. At each spatial location, features of different levels are adaptively fused, that is, some features may be filtered out because they carry contradictory information at that location, and some features dominate at that location because they carry more differentiated information. In summary, using the ASFF module in the encoder part of the local feature extraction network architecture can better capture image information.

In the decoder part, as shown in the Fig. 4, we solve the feature point location and descriptor respectively and use them as the input of the local feature matching network.

Fig. 4.

Fig. 4

Diagram of the Decoder.

Local feature matching module

In the image feature matching module, a novel linear attention mechanism (LAM) is presented to replace the traditional scaled dot-product attention mechanism for contextual information aggregation. While the traditional scaled dot-product attention mechanism has demonstrated effectiveness in enhancing feature aggregation capabilities of neural networks28, its quadratic computational complexity introduces significant latency, particularly in resource-constrained scenarios. Using dot-product operations to calculate attention weights can lead to weight decay or explosion problems, especially when dealing with long-distance dependencies. And the exponential Softmax operation and large-scale matrix multiplication make it difficult to deploy the algorithm in UAV devices. Under satellite-denied conditions, UAV visual navigation systems demand exceptionally high response speeds to ensure real-time operation. The presented LAM addresses this limitation by reducing computational overhead while preserving the ability to capture long-range dependencies, thereby enabling efficient and robust feature matching under stringent timing constraints. The schematic diagram of the two attention mechanisms is shown in Fig. 5.

Fig. 5.

Fig. 5

Comparison of two attention mechanisms.

The calculation formula of the dot-product attention mechanism is shown in (2):

graphic file with name d33e456.gif 2

The new attention mechanism uses lightweight SiLU instead of exponential Softmax, and cleverly uses the properties of matrix multiplication to greatly reduce the amount of calculation28. This allows it to be deployed in UAV devices with limited computing resources. The improved formula is defined as follows:

graphic file with name d33e468.gif 3

Figure 6 shows the comparison of computational complexity. The computational complexity of scaled dot-product attention is O(n2d), while the computational complexity of linear attention mechanism is O(nd2). This is because traditional attention mechanisms first compute the attention matrix QKT which has a size of n × n, leading to an O(n2d) complexity when multiplied by V. In contrast, linear attention mechanism computes KTV first, resulting in a d × d matrix. This leads to an O(nd2) complexity when subsequently multiplied by Q. Typically, the n is larger than the d. This scenario is common in tasks involving long sequences—such as, image processing—where the sequence length can reach thousands of tokens, while the d remains relatively small (e.g., 128, 512). In such cases, O(nd2) is far smaller than O(n2d), which represents a significant advantage.

Fig. 6.

Fig. 6

Comparison of computational complexity.

After obtaining the matching points, we use the Sinkhorn algorithm to calculate the correspondence between the key points of the image pairs. This converts the spatial problem into an optimal transmission problem, which is efficient and scalable when processing large-scale keypoints.

Datasets introduction

We use the University-1652 dataset and the COCO201729 dataset to jointly train the model. The University-1652 dataset collects satellite and drone-view images of 1652 buildings from 72 universities around the world including both urban and suburban campuses, which provides diverse building layouts and background complexity. This dataset is the first drone-based geolocation dataset, which has promoted the development of drone target positioning and visual navigation to a certain extent. The image height in the dataset gradually increases from 121.5 to 256 m, which can well show the buildings at different drone heights and perspectives, which is very helpful for training our model. The University-1652 dataset contains a total of 37,856 drone-view images, all of which are 512*512 in size. In our experiment, we use 30,000 images as a training set to generate random homography transformation matrices to simulate different perspectives for image matching training models, and 7856 images as a validation set. The following Fig. 7 shows some examples of the University-1652 dataset:

Fig. 7.

Fig. 7

University-1652 drone dataset example (image sourced from the University-165221 dataset).

In order to improve the performance of the model in different scenarios, we also use the COCO2017 dataset. The images in this dataset come from daily life scenes, cover various complex object combinations and include multiple weather conditions or lighting environments. In our experiment, we use 118,000 images to simulate different perspectives through random homography matrices for unsupervised training, and 18,000 images to generate the corresponding homography transformation matrix as verification. In order to ensure consistency during training and verification, we crop all images in the COCO2017 dataset to a size of 480*640. This cropping process standardizes the entire dataset. The following Fig. 8 shows some examples from the COCO2017 dataset:

Fig. 8.

Fig. 8

COCO2017 dataset example (example image from the COCO201729 dataset).

Training environment

All programs in this experiment are written in Python, and the network framework is the open-source PyTorch, with its version being PyTorch 2.1.2. To improve the computing efficiency, CUDA (version CUDA 11.8) developed by NVIDIA is used to call GPU resources to accelerate the neural network reasoning operation, which is implemented on the Ubuntu 22.04 Linux system. For the hardware configuration, we rented an NVIDIA RTX 3090 graphics card on the AutoDL platform, which has a 24G video memory. We choose an AMD EPYC 7642 48-Core processor as the CPU. The above hardware and software configuration is reasonable and can meet the training requirements of this experimental model. In order to prevent gradient explosion or disappearance, we dynamically adjust the learning rate. In the first 20 epochs, we set the learning rate to 0.0001. From 21st to 40th epoch, the learning rate changes dynamically and decays to 1/10 of the original value every 10 epochs. The specific training configuration parameters are shown in the following Table 1:

Table 1.

Specific training parameters

Training parameters Settings
Epochs 40
Batch size 16
Image size 640*480 or 512*512
Learning rate

1e-4,1 <  = epoch < 20

Inline graphic, 21 <  = epoch < 40

Num_workers 14
Seed 0

Experiments

Evaluation metrics

We evaluate the accuracy of homographies estimated from the correspondences using robust solvers RANSAC30 and non-robust solvers DLT31. For each image pair, we report the area under the cumulative error curve (AUC) @ 5 °,10 ° and 20 °. The model is evaluated with pose accuracy (% of correct poses within (5 °,10 °,20 °) of error) and the poses are derived from the estimated correspondences using RANSAC, and we use Rodrigues’ formula to calculate relative rotation error between the predicted/ground truth rotation matrices. Like most image matching algorithms, we use the MegaDepth-150032 datasets for model evaluation. The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS33. This dataset contains not only images, but also the camera intrinsic parameters corresponding to each image and the positional relationship between different matching object pairs, which provides support for image matching.

Mean matching accuracy (MMA) is the percentage of correct matches to all predicted matches. The key points in the i-th query image are projected into the reference image using the homography matrix H, and then the matches with a projection error lower than a predetermined threshold t are considered correct matches. MMA is calculated as follows:

graphic file with name d33e625.gif 4

where 1500 represents the number of image pairs in the MegaDepth-1500 dataset, N represents the number of predicted matches, i (·) is output as 1 if non-negative and 0 otherwise: ε is the threshold of the reprojection error in pixels, Hi (·) represents the transformation of the key points in the i-th query image to the reference image through the homography matrix, and Inline graphic represents the pixel coordinates of the j-th match in the i-th image pair. In our experiments, correspondences with a reprojection error lower than 3 pixels are deemed correct.

When evaluating image matching models, response speed is a critical metric, particularly in UAV applications where computing resources are limited and real-time performance is essential. The response time is defined as the interval from the inputting to the DALGlue model through image feature extraction, unmatched point pruning, feature point matching, and finally to the output of the matching result. It should be noted that response times are averaged over a set of image pairs. When the matching accuracy is comparable among models, a shorter response time indicates better performance. Under GPU conditions, we measured response times using 512, 1024, 2048, and 4096 feature points, whereas under CPU conditions, we used 128, 256, 512, and 1024 feature points. All time measurements are reported in milliseconds.

Comparative experiments

In order to verify the rationality of our proposed method DALGlue, we compare the DALGlue with some common image feature extraction and matching algorithms. For example, the traditional image feature extraction algorithm SIFT, the semi-dense matching algorithm LoFTR, and the deep learning-based SuperGlue, OmniGlue34 and LightGlue. It should be noted in advance that when evaluating the SIFT algorithm, we use the SuperGlue algorithm for matching. OmniGlue and LightGlue both use the SuperPoint algorithm for feature extraction. The rest of the experimental settings are the same. The following Table 2 shows the evaluation results on the MegaDepth-1500 dataset.

Table 2.

Relative pose estimation On MegaDepth-1500.

RANSAC (%) Precision (%) Recall (%)
AUC@5° AUC@10° AUC@20°
LoFTR 52.8 69.2 81.2
Efficient LoFTR35 56.4 72.2 83.5
SIFT 23.68 36.44 49.44
SuperGlue 34.18 50.32 64.16 74.6 90.5
OmniGlue34 47.4 65.0 77.8 82.1 95.3
LightGlue 47.83 65.48 79.04 86.8 96.3
DALGlue 57.01 73.0 84.11 87.2 97.5
Relative gain over LightGlue 19.1% 11.4% 6.4%
ROMA36(Dense) 62.6 76.7 86.3

Table 2 illustrates that the DALGlue method exhibits varying degrees of improvement over other deep learning-based algorithms. To assess pose accuracy, we employ the RANSAC algorithm and report AUC values @ 5 °, 10 °, and 20 °. The LoFTR algorithm, which leverages self-attention and mutual attention layers to process dense local features extracted from convolutional networks, obviates the need for pre-detection of keypoints and descriptors. Under RANSAC evaluation, LoFTR achieves AUC values of 52.8, 69.2, and 81.2 @ 5 °, 10 °, and 20 °, respectively. In contrast, the SuperGlue algorithm, which relies on a neural network for feature extraction and matching, attains AUC values of 34.18, 50.32, and 64.16, respectively. The LightGlue algorithm, incorporating a pruning layer based on SuperGlue to discard unmatched points for enhanced computational efficiency, yields AUC values of 47.83, 65.8, and 79.04, respectively. Notably, DALGlue improves AUC @5 °, AUC @10 °, and AUC @20 ° by 19.1%, 11.4%, and 6.4% compared to LightGlue, and by 7.9%, 5.4%, and 3.5% compared to LoFTR, underscoring its competitiveness as an image matching algorithm. Dense matcher ROMA showcases remarkable matching capabilities, with AUC values reaching 62.6, 76.7, and 86.3 respectively. Even when compared to the dense matcher ROMA, DALGlue remains highly competitive.

Furthermore, we evaluate the accuracy of homography estimation derived from the correspondences using a non-robust solver, specifically the weighted DLT31. As detailed in Table 3, DALGlue achieves AUC values of 37.3 and 79.9 under the @1px and @5px conditions, representing improvements of 6.2% and 2.9% over LightGlue. Although DALGlue performs slightly lower than LoFTR under the @1px condition, it surpasses other algorithms in all remaining metrics. In summary, DALGlue enhances image matching accuracy and demonstrates robust performance across diverse scenarios, thereby effectively supporting UAV visual navigation applications.

Table 3.

Homography estimation on HPatches.

AUC-DLT
@1px @5px
LoFTR 38.5 70.6
SuperGlue 32.1 75.7
LightGlue 35.1 77.6
DALGlue 37.3 79.9
Relative gain over LightGlue 6.2% 2.9%

Significant values are in bold.

To evaluate model performance under limited computing resources, we measure the processing times of DALGlue and several comparative models on an NVIDIA RTX3080 GPU (see Fig. 9). Using an aerial photography dataset, we fix the number of extracted feature points at 512, 1024, 2048, and 4096, and computed the average processing time for matching. The SuperGlue algorithm requires 41.2, 40.7, 86.5, and 297.2 ms, respectively. In its “SuperGlue-fast” variant, the number of Sinkhorn algorithm iterations is limited to five, thereby accelerating the model at the expense of some matching accuracy; its response times are 25.1, 25.1, 45.5, and 144.8 ms. The LightGlue-full algorithm performs matching directly without prior pruning of unmatched points, yielding response times of 28.3, 28.6, 31.2, and 87.7 ms. In contrast, the LightGlue-adaptive, which incorporates an adaptive pruning strategy, achieves response times of 13.3, 13.4, 13.4, and 21.9 ms. By pruning unmatched points in advance, the proposed DALGlue method further improves response speed, registering processing times of 8.7, 9.0, 9.7, and 23.3 ms. Notably, DALGlue outperforms LightGlue-adaptive at 512, 1024, and 2048 feature points, and at 4096 feature points, its processing time exceeds that of LightGlue-adaptive by only 1.4 ms, demonstrating its excellent responsiveness.

Furthermore, we evaluate the performance of these models on a CPU environment (see Fig. 10). For a fixed set of 128, 256, 512, and 1024 feature points, the SuperGlue algorithm requires 201.7, 423.1, 1141.1, and 3590.2 ms, respectively. The SuperGlue-fast variant reduces these times to 187.1, 417.9, 963.4, and 2808.2 ms. The LightGlue-full model yields processing times of 208.1, 398.7, 937.1, and 2841.9 ms, while the LightGlue-adaptive model achieves 123.1, 177.2, 400.1, and 1087.1 ms. Due to the introduction of a linear attention mechanism, the proposed DALGlue method further reduces CPU processing times to 113.0, 171.1, 381.9, and 1154.7 ms. Overall, DALGlue demonstrates substantially higher response speeds compared to SuperGlue and exhibits competitive performance relative to LightGlue, requiring only 2.1% additional time over LightGlue-adaptive at 1024 feature points. In addition, visualizations of the image matching results for SuperGlue, LightGlue, and DALGlue are provided to intuitively illustrate the differences in matching performance among these algorithms. Selected matching results are presented Fig. 11 below.

Fig. 9.

Fig. 9

Algorithm latency comparison on GPU.

Fig. 11.

Fig. 11

Qualitative image matching comparison (a)SuperGlue. (b)LightGlue. (c)DALGlue (image sourced from the university-165221dataset).

In the Fig. 11 above, (a) represents the SuperGlue algorithm, (b) represents the LightGlue algorithm, and (c) represents the proposed DALGlue method. We strictly unified key parameters during the experiment, where correct matches are represented by green lines and mismatches by red lines. The visualization clearly demonstrates that DALGlue offers superior performance for aerial image matching in UAV applications. Specifically, compared to SuperGlue, DALGlue significantly reduces the number of incorrect matching points, thereby enhancing the reliability of the correspondence. Moreover, when compared to LightGlue, DALGlue achieves a higher number of accurate matching points while employing an equal or even reduced number of network layers. This increased density of valid matches is instrumental in computing a more precise rotation matrix, which in turn leads to improved accuracy in determining the UAV’s position.

Ablation experiments

In order to evaluate the contribution of each module to the performance of the DALGlue method, this paper conducts ablation experiments on each module. First, the SuperPoint + LightGlue model is used as the baseline algorithm, and then ablation experiments are performed on each module and finally compare with the model DALGlue. Similarly, we use the RANSAC algorithm on the MegaDepth-1500 datasets to evaluate the AUC and MMA% metrics. The specific experimental results are shown in the Table 4 Among them, DTCWT is used to preprocess the input image pair by dual-tree complex wavelet transform with 3 decomposition layers, where the coefficient for the Value channel is 1.2 × . The ASFF module represents the use of adaptive spatial feature fusion for feature extraction, and LAM represents the use of linear attention mechanism to match feature points. As shown in the Table 4, the different modules of the network are well integrated.

Table 4.

Ablation study on RANSAC homography estimation.

DTCWT ASFF LAM MMA% RANSAC AUC
5 ° 10 ° 20 °
LightGlue 64.1 47.83 65.48 79.04
Model A 67.7 53.32 69.02 80.83
Model B 65.4 49.22 66.83 80.02
Model C 70.1 56.38 71.4 82.57
Model D 70.4 55.9 72.28 83.18
Model E 68.0 54.03 69.29 80.86
Model F 66.8 51.43 68.34 80.64
DALGlue(full) 71.7 57.01 73.0 84.11

Significant values are in bold.

Model A employs an adaptive spatial feature fusion module to aggregate multi-scale feature map information, thereby enhancing overall performance. This module dynamically adjusts the fusion strategy at each spatial location, facilitating a more accurate capture of long-range dependencies and complex spatial structures. In contrast, conventional convolutional neural networks utilize fixed-size convolution kernels with predetermined receptive fields, limiting their capacity to effectively capture features across varying scales and positions—especially in challenging aerial scenes. Model A achieves MMA of 67.7% and obtains RANSAC AUC values of 53.53, 69.02, and 80.83 @ 5 °, 10 °, and 20 ° thresholds, respectively. Model B introduces a linear attention mechanism which is designed to effectively capture long-term dependencies in extended sequences. The mechanism replaces the traditional Softmax37 function with a learnable weight matrix, preserving the model’s expressiveness. Consequently, Model B achieves an MMA of 65.4% and RANSAC AUC values of 49.22, 66.83, and 80.02 @ 5 °, 10 °, and 20 °, respectively. Model C utilizes a dual-tree complex wavelet transform for preprocessing the input image in conjunction with a linear attention mechanism for context aggregation. This approach increases the MMA to 70.1%, representing a 9.3% improvement over the baseline LightGlue model. Model D extends the preprocessing pipeline of Model A by incorporating additional image processing steps. It achieves an MMA of 70.4% and RANAC AUC values of 55.9, 72.28, and 83.18 at the corresponding angular thresholds, which correspond to improvements of 16%, 10.3%, and 5.2% over LightGlue. The close performance of Model C and Model D indicates that both modules improve the system from different perspectives—LAM mainly optimizes attention mechanism, while ASFF strengthens multi-scale feature representation. Model E combines both adaptive spatial feature fusion and the linear attention mechanism, yielding RANAC AUC values of 54.03, 69.29, and 80.86 @ 5 °, 10 °, and 20 °, respectively. Overall, compared with the baseline LightGlue model, the proposed DALGlue method exhibits a notable enhancement, achieving an MMA of 71.7% (an 11.8% increase) and RANAC AUC values of 57.42, 73.57, and 84.11 @ 5 °, 10 °, and 20 ° thresholds, respectively. The effective mutual integration of these modules underpins its robust performance on the image matching MegaDepth-1500 dataset.

By comparing Model B and Model E, the ASFF integrates information of different scales and can greatly improve the matching accuracy. By comparing Model A and Model D, it can analyzed that the important role of the DTCWT in enhancing edge information. The LAM is designed to reduce the matching time and improve the matching efficiency. Therefore, using only the LAM will not improve the overall performance of the model much (a 2% increase on MMA), but it can achieve significant performance improvement with the DTCWT and ASFF.

Figure 12 illustrates the average processing time required by different models to extract and match 512, 1024, 2048, and 4096 feature points on an NVIDIA RTX3080 GPU using an aerial photography dataset. Specifically, the LightGlue model requires 18.1, 17.9, 19.8, and 24.5 ms, respectively. Notably, although an increase in the number of feature points generally leads to longer processing times, the processing time for 512 feature points is marginally higher than that for 1024 feature points, which is attributed to the increased computational cost of pruning unmatched points.

Fig. 12.

Fig. 12

Ablation study on latency.

Both Model A and Model D incorporate an adaptive spatial feature fusion module, which dynamically selects the most informative features at each spatial location. This mechanism enables the models to flexibly prioritize feature levels based on local context and scale, thereby enhancing matching accuracy and reducing processing time. Consequently, Model A requires 13.9, 13.9, 15.8, and 24.5 ms, whereas Model D requires 14.1, 13.8, 15.4, and 24.1 ms, respectively. The slightly higher processing time observed in Model D is due to the additional preprocessing performed by the dual-tree complex wavelet transform module, which, despite its extra computational cost, yields a significant improvement in accuracy with relatively low latency.

Models B, C, and E utilize a linear attention mechanism module, which offers high computational efficiency and lower resource consumption, rendering these models particularly suitable for UAV applications where computational resources are limited. As shown in Fig. 10, the average matching time of models employing the linear attention mechanism is substantially lower than that of other models. These models require no more than 10 ms for extracting and matching 2048 feature points and no more than 23.3 ms for 4096 feature points. In contrast, the DALGlue model integrates the dual-tree complex wavelet transform, adaptive spatial feature fusion, and linear attention mechanism modules concurrently, achieving average matching times of 8.7, 9.0, 9.7, and 23.3 ms, respectively, while providing higher matching accuracy that satisfies the low-latency requirements for UAV visual navigation.

Fig. 10.

Fig. 10

Algorithm latency comparison on CPU.

There is a synergistic effect between different modules. The three modules in DALGlue-DTCWT, ASFF, and LAM form a complementary pipeline that enhances both the accuracy and efficiency of UAV image matching. Specifically, the DTCWT module serves as a preprocessing stage that decomposes images into multi-directional representations, enhancing high-frequency edge details. The ASFF module then adaptively fuses features from multiple scales by assigning spatially varying weights, ensuring that both local textures and broader structural patterns are effectively preserved. Finally, the LAM module efficiently models long-range dependencies among keypoints with reduced computational complexity, and its performance benefits from the high-quality features.

Homography estimation on University-1652

We utilized the University-1652 datasets to simulate UAV visual navigation scenarios. For each input image, we generated a random homography matrix that combines rotation, translation, scaling, and perspective transformation to simulate viewpoint changes encountered in UAV flight. In the following example, the image on the left represents the original image, and the image on the right represents the image after rotation, translation, scaling, and perspective transformation. Figure 13 shows some samples used for homography estimation.

Fig. 13.

Fig. 13

Examples for homography estimation (original image comes from University-165221dataset).

We then used the predicted point correspondences from each algorithm to estimate the homography and evaluated the accuracy by computing the pixel-level reprojection error against the ground-truth homography. The cumulative matching accuracy at thresholds of 5 °, 10 °, and 20 ° was computed, and the area under this curve (AUC) was reported as a performance metric. All evaluation parameters are consistent with those in Part B. The specific evaluation results are shown in Table 5. From the table, it can be concluded that whether using SuperGlue or LightGlue, DALGlue will continuously improve over the baseline on the UAV vision dataset.

Table 5.

Relative pose estimation on University-1652.

RANSAC (%) Parameters
(M)
AUC@5° AUC@10° AUC@20° Precision (%) Recall (%)
SuperGlue 23.42 33.67 46.69 76.8 89.9 11.8
LightGlue 28.93 47.38 69.08 83.1 92.6 10.6
DALGlue 34.85 53.94 73.69 84.66 94.07 10.2

Discussion

DALGlue has higher adaptability and better matching performance in the field of aerial image matching in challenging aerial scenes by introducing dual-tree complex wavelet transform, adaptive spatial feature fusion and linear attention mechanism. Two critical limitations warrant further investigation:

  1. Dynamic object interference The current framework assumes static environments, leading to cumulative error amplification. When a large number of dynamic objects appear in the environment, the algorithm may misjudge the movement of objects as the movement of UAVs, resulting in an increase in cumulative errors and a decrease in visual navigation effect. The issue can be alleviated by using the dynamic object filtering method based on semantic segmentation38 and we are currently conducting relevant experiments. For future work, a promising direction is to allowing the model to explicitly distinguish between static structures and dynamic elements. By adaptively suppressing features from dynamic regions while preserving those from static backgrounds, DALGlue could achieve more robust performance in dynamic environments. Additionally, lightweight semantic segmentation can be employed to guide the algorithm in performing feature point matching within the same semantic module, which is theoretically expected to enhance matching accuracy.

  2. Cross-domain generalization future research can also combine non-homologous images for matching, such as satellite images, ground-view images, etc. Furthermore, some of studies have also proposed Vision-and-Language Navigation39 (VLN) to achieve advanced autonomy: these technologies enable UAVs to move beyond simple path planning and perform more complex, high-level tasks, such as intelligent inspection, disaster search and rescue, and urban monitoring40. The combination of multi-source remote sensing data can further improve the accuracy of UAV visual navigation.

Conclusion

This paper proposes an efficient image matching method DALGlue, based on convolutional neural network feature extraction algorithm and feature matching network. Firstly, the aerial images are preprocessed using dual-tree complex wavelet transform to enhance the high-frequency components. Then, an adaptive spatial feature fusion module is employed to extract image features and calculate feature points and descriptors. Finally, a linear attention mechanism is introduced to aggregate context information to match feature points, and unmatched points are pruned to improve computational efficiency. DALGlue achieves absolute gains of + 19.1% (AUC@5 °), + 11.4% (AUC@10 °), and + 6.4% (AUC@20 °) over LightGlue on MegaDepth-1500 dataset. At the same time, the average response speed of extracting and matching 512, 1024, 2048, and 4096 feature points is shortened to 8.7, 9.0, 9.7, and 23.3 ms, respectively. In general, DALGlue has good performance and effectively improves the matching accuracy and response speed of UAV aerial images in challenging aerial scenes.

Author contributions

H.Z. and H.H. conceived the study; H.Z. developed the methodology and implemented the software; H.Z. and F.L. performed the validation; H.Z. conducted the formal analysis; H.H. and Y.Z. carried out the investigation and provided resources; H.Z. curated the data and prepared the original draft of the manuscript; H.Z., H.H., and Y.Z. reviewed and edited the manuscript; H.Z. created all visualizations; H.H., Y.Z., and F.L. supervised the project; H.Z. and H.H. administered the project; H.H. and Y.Z. acquired the funding.

Funding

This work was supported in part by National Natural Science Foundation of China under Grant 32200546, in part by Hebei Natural Science Foundation under Grant C2024202003, in part by the Science and Technology Cooperation Special Project of Shijiazhuang under Grant SJZZXA23005 and in part by Open Topics in Key Laboratory of Natural Resources Monitoring and Supervision in Southern Hilly Region, Ministry of Natural Resources under Grant NRMSSHR-2022-Y03.

Data availability

The data presented in this study are available on reasonable request from the corresponding author. The University-1652 and COCO2017 datasets analysed during the current study were obtained from https://paperswithcode.com/dataset/university-1652, https://cocodataset.org/#download,separately.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hao He, Email: hehao@hebut.edu.cn.

Fan Lei, Email: xyp@img.net.

References

  • 1.Liu, Y., Ma, Y., Qi, Y., Qing, L. & Li, G. Boosting UAV detection via memory-enhanced attention and contrastive learning. IEEE Signal Process. Lett.32, 3132–3136. 10.1109/lsp.2025.3592113 (2025). [Google Scholar]
  • 2.Xue, Y. et al. Consistent representation mining for multi-drone single object tracking. IEEE Trans. Circuits Syst. Video Technol.34, 10845–10859. 10.1109/TCSVT.2024.3411301 (2024). [Google Scholar]
  • 3.Klemas, V. V. Coastal and environmental remote sensing from unmanned aerial vehicles: An overview. J. Coast. Res.31, 1260–1267. 10.2112/jcoastres-d-15-00005.1 (2015). [Google Scholar]
  • 4.Xue, Y. et al. Handling occlusion in uav visual tracking with query-guided redetection. IEEE Trans. Instrum. Meas.73, 1–17. 10.1109/TIM.2024.3440378 (2024). [Google Scholar]
  • 5.Wang, Y., Tian, Y., Chen, J., Xu, K. & Ding, X. A survey of visual SLAM in dynamic environment: The evolution from geometric to semantic approaches. IEEE Trans. Instrum. Meas.10.1109/tim.2024.3420374 (2024).39867724 [Google Scholar]
  • 6.Zeng, F., Wang, C. & Ge, S. S. A survey on visual navigation for artificial agents with deep reinforcement learning. Ieee Access8, 135426–135442. 10.1109/access.2020.3011438 (2020). [Google Scholar]
  • 7.Xue, Y. et al. Target-distractor aware UAV tracking via global agent. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2025.3581391 (2025). [Google Scholar]
  • 8.Al-Jarrah, O. Y., Shatnawi, A. S., Shurman, M. M., Ramadan, O. A. & Muhaidat, S. Exploring deep learning-based visual localization techniques for UAVs in GPS-denied environments. IEEE Access12, 113049–113071. 10.1109/access.2024.3440064 (2024). [Google Scholar]
  • 9.Zhang, S., Liu, Z., Xu, B., Wang, J. & Li, Y. Fusion GNSS/INS/Vision with path planning prior for high precision navigation in complex environment. IEEE Sens. J.25, 9045–9055. 10.1109/jsen.2025.3525842 (2025). [Google Scholar]
  • 10.Sun, J. et al. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference Computer Vision and Pattern Recognition (CVPR) 8918–8927 (2021).
  • 11.DeTone, D., Malisiewicz, T. & Rabinovich, A. Toward Geometric Deep SLAM. Preprint at http://arxiv.org/abs/1707.07410 (2017).
  • 12.Ty, N., Chen, S. W., Shivakumar, S. S., Taylor, C. J. & Kumar, V. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robot. Autom. Lett.3, 2346–2353. 10.1109/lra.2018.2809549 (2018). [Google Scholar]
  • 13.Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. Npj Herit. Sci.10.1038/s40494-025-01565-6 (2025). [Google Scholar]
  • 14.Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas.71, 1–13. 10.1109/TIM.2021.3132332 (2022). [Google Scholar]
  • 15.DeTone, D., Malisiewicz, T. & Rabinovich, A. IEEE Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 337–349 (2018).
  • 16.Sarlin, P.E., DeTone, D., Malisiewicz, T. & Rabinovich, A. IEEE SuperGlue: Learning feature matching with graph neural networks. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4937–4946 (2020).
  • 17.Lindenberger, P., Sarlin, P.E. & Pollefeys, M. IEEE.LightGlue: Local feature matching at light speed. In Proceedings of theIEEE/CVF International Conference on Computer Vision (ICCV) 17581–17592 (2023).
  • 18.Zhou, Y., Guo, Y., Lin, K.-P., Yang, F. & Li, L. USuperGlue: An unsupervised UAV image matching network based on local self-attention. Soft. Comput.10.1007/s00500-023-09088-7 (2023).37362265 [Google Scholar]
  • 19.Xue, Y. et al. SmallTrack: Wavelet pooling and graph enhanced classification for UAV small object tracking. IEEE Trans. Geosci. Remote Sens.10.1109/tgrs.2023.3305728 (2023). [Google Scholar]
  • 20.Vaswani, A. et al. Attention is all you need. In 31st Annual Conference on Neural Information Processing Systems (NIPS) (2017).
  • 21.Zheng, Z., Wei, Y. & Yang, Y. Assoc Comp, M.University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the28th ACM International Conference on Multimedia (MM) 1395–1403 (2020).
  • 22.Sebai, H., Kourgli, A. & Serir, A. Dual-tree complex wavelet transform applied on color descriptors for remote-sensed images retrieval. J. Appl. Remote Sens.10.1117/1.Jrs.9.095994 (2015). [Google Scholar]
  • 23.Dragomiretskiy, K. & Zosso, D. Variational mode decomposition. IEEE Trans. Signal Process.62, 531–544. 10.1109/tsp.2013.2288675 (2014). [Google Scholar]
  • 24.Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2023.3346488 (2024). [Google Scholar]
  • 25.Xie, J., Pang, Y., Nie, J., Cao, J. & Han, J. Latent feature pyramid network for object detection. IEEE Trans. Multimed.25, 2153–2163. 10.1109/tmm.2022.3143707 (2023). [Google Scholar]
  • 26.Li, X. et al. Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2703–2712 (2020).
  • 27.Ge, C., Song, Y., Ma, C., Qi, Y. & Luo, P. rethinking attentive object detection via neural attention learning. IEEE Trans. Image Process.33, 1726–1739. 10.1109/TIP.2023.3251693 (2024). [DOI] [PubMed] [Google Scholar]
  • 28.Vu Minh Hieu, P. et al. Structural attention: Rethinking transformer for unpaired medical image synthesis. In Proceedings of the27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 690–700 (2024).
  • 29.Lin, T.-Y. et al. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV) 740–755 (2014).
  • 30.Larsson, V. PoseLib: Minimal Solvers for Camera Pose Estimation (2024). <https://github.com/vlarsson/PoseLib>.
  • 31.Hartley, R. Z., Andrew. Multiple View Geometry in Computer Vision. 2nd edition edn, (Cambridge University Press, 2004).
  • 32.Li, Z., & Snavely, N. IEEE. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2041–2050 (2018).
  • 33.Harwin, S. & Lucieer, A. Assessing the accuracy of georeferenced point clouds produced via multi-view stereopsis from unmanned aerial vehicle (UAV) imagery. Remote Sens.4, 1573–1599. 10.3390/rs4061573 (2012). [Google Scholar]
  • 34.Jiang, H. et al. OmniGlue: Generalizable feature matching with foundation model guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19865–19875 (2024).
  • 35.Wang, Y. et al. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 21666–21675 (2024).
  • 36.Edstedt, J. et al. RoMa: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19790–19800 (2024).
  • 37.Liu, W., Wen, Y., Yu, Z. & Yang, M. Large-margin softmax loss for convolutional neural networks. In Proceedings of the 33rd International Conference on Machine Learning (2016).
  • 38.Chen, W. et al. Multi-attention network for compressed video referring object segmentation. In Proceedings of the 30th ACM International Conference on Multimedia (MM) 4416–4425 (2022).
  • 39.Liu, S. et al. AerialVLN (sic): Vision-and-language navigation for UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 15338–15348 (2023).
  • 40.Xue, Y. et al. AVLTrack: Dynamic sparse learning for aerial vision-language tracking. IEEE Trans. Circuits Syst. Video Technol.35, 7554–7567. 10.1109/TCSVT.2025.3549953 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author. The University-1652 and COCO2017 datasets analysed during the current study were obtained from https://paperswithcode.com/dataset/university-1652, https://cocodataset.org/#download,separately.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES