Abstract
Deformable attention only focuses on a small group of key sample-points around the reference point and make itself be able to capture dynamically the local features of input feature map without considering the size of the feature map. Its introduction into point cloud registration will be quicker and easier to extract local geometric features from point cloud than attention. Therefore, we propose a point cloud registration method based on Spatial Deformable Transformer (SDT). SDT consists of a deformable self-attention module and a cross-attention module where the deformable self-attention module is used to enhance local geometric feature representation and the cross-attention module is employed to enhance feature discriminative capability of spatial correspondences. The experimental results show that compared to state-of-the-art registration methods, SDT has a better matching recall, inlier ratio, and registration recall on 3DMatch and 3DLoMatch scene, and has a better generalization ability and time efficiency on ModelNet40 and ModelLoNet40 scene.
Subject terms: Computer science, Information technology
Introduction
Point cloud registration is a significant task in the field of computer vision and plays a crucial role in the fields of 3D reconstruction1,2, SLAM3,4 and autonomous driving5,6 and so on. The process of 3D point cloud registration techniques is to align multiple point clouds from different viewpoints or sensors into a same coordinate system. Due to the effects of noise, outliers, low overlap rate, etc., point cloud registration becomes a challenging problem. Therefore, it is of great theoretical and practical significance to implement a high-precision and robust point cloud registration algorithm.
The traditional Iterative Closest Point (ICP)7 is the most widely used rigid point cloud registration algorithm, which minimizes point-to-point or point-to-plane distances in the overlapping areas between point clouds, and alternately updates the corresponding relationship and transformation matrix between source point cloud and target point cloud. However, the main drawback of the ICP algorithm is that it easily converges to local optimums. To address this problem, J. Yang et al.8 proposed global optimal iterative nearest-point algorithm Go-ICP, which uses a branch-and-bound approach to search for the globally optimal registration at the cost of longer computing time.
With the continuous improvement of computer performance, deep learning-based methods have transformed traditional feature extraction methods. Choy et al.9 proposed FCGF, which used a ResUNet10 architecture built on 3D sparse convolution to extract features. However, FCGF is computationally expensive and implicitly decreases resolution11. Ao et al.12 proposed SpinNet to extract point cloud rotation invariance features. It consists of two modules, a spatial point transformer and a feature extractor, which make the network be able to learn local spatial features with strong robustness to finely register point cloud. X. Bai et al.13 proposed D3Feat including a KPConv14 feature extraction network which can be extended to deformable convolutions that learn to adapt kernel points to local geometry. Meanwhile, D3Feat proposed a novel keypoint selection method and a self-supervised detector loss to eliminate the impact of point cloud density on keypoints. On the basis of D3Feat, combined with the attention mechanism15, S.Huang et al.16 proposed PREDATOR which alternately uses self-attention and cross-attention mechanisms, and aggregates local and global information of point cloud. PREDATOR showing higher registration accuracy on 3DMatch scene dataset. Li et al.17 proposed Lepard which can register point cloud in deformable scenes. Lepard builds network using Transformer with self and cross attention, and ideas with differentiable matching. In rigid cases, Lepard combined with RANSAC18 and ICP demonstrates state-of-the-art registration recall. In deformable cases, Lepard also achieves higher non-rigid feature matching recall than the state-of-the-art method.
With the development of Transformer with self/cross attention in point cloud registration, the accuracy and precision of point cloud registration have a certain improvement. However, self/cross attention in Transformer is a global mechanism that focuses on all positions in the input features, which makes it difficult for the model to capture local features of the point cloud, and thereby reduces the robustness of registration to noise under low overlap scene. Meanwhile, self/cross attention calculates the weight of each position in the input features, which increases the computational complexity and affects registration efficiency. In recent years, some researchers proposed deformable attention19,20 on 2D image detection/recognition/classification to break through the limitation of self/cross attention. Deformable self/cross attention is a more flexible attention mechanism, whose core idea is to introduce deformability into the traditional self/cross attention mechanism, and allows the model to adaptively adjust the attention focus based on the specific situation of the input feature map. Deformable self/cross attention only focuses on a small group of key sampling points around the reference point, without considering the size of the feature map, and dynamically adjusts the weights of different positions in the input feature. In this way, it can obtain local features and improve the efficiency of feature extraction.
In this paper, we introduce deformable self/cross attention into point cloud registration and use spatial local positional relationships as the local position embeddings for deformable self-attention. Based on these, we propose Spatial Deformable Transformer (SDT) for point cloud registration. This approach enhances the ability to learn local geometric features through the SDT module, and reduces effectively the mismatching impact on the robustness of registration by constructing correspondence matrix based on Sinkhorn and Hungarian algorithm. Our main contributions are shown as follows.
We propose a novel 3D point cloud registration network based on SDT to address point cloud registration under low overlap scenes.
We construct a deformable self-attention module to interact local geometric spatial information within the point cloud to enhance the representation of features and make them easier to match.
We construct a deformable cross-attention module to transfer features between point clouds to enhance feature discriminative capability of spatial correspondences.
We design a balanced weighted loss function which uses focal loss between soft correspondence confidence matrix and the ground truth correspondence matrix as supervision to obtain more accurate hard matching correspondences between pairs of point clouds.
Related Work
Traditional point cloud registration
ICP7 is a classical traditional point cloud registration method, which finds the closest target points for each point in source point to generate 3D-3D correspondences and performs a least-squares optimization to compute rigid transformation between a pair of point clouds. The two steps are iteratively performed until a termination condition is satisfied. Many variants, such as Go-ICP8, Generalized-ICP21 and Sparse ICP22, have been proposed to increase its efficiency or o improve robustness to noise and mismatches. However, the main drawback of these methods is that they require proper initialization to converge to a good solution23. Another issue of ICP and its variants is poor robustness to outliers and partial overlaps that often occur in real-world data. Therefore, some traditional methods register point cloud by matching local shape descriptor and RANSAC algorithm. The representative shape descriptor includes PFH24, FPFH25, SHOT26, RoPs27, GASD28 etc. Nevertheless, the quality of such hand-craft descriptors can be affected by the point density and outliers29, and heavily rely on low-level geometrical attributes to compute orientations30.
Learning-based point cloud registration
Recently, various deep learning31–34 approaches have been proposed for registration, such as PREDATOR16, REGTR35, PCRNet36, and so on. Learning-based Registration can be summarized into two categories: Feature learning-based methods and End-to-end learning-based methods. Unlike the traditional point cloud registration methods, Feature learning-based methods use deep neural network to learn a robust feature correspondence search, and then, the transformation matrix is ultimately determined through one-step estimation (e.g. RANSAC) without any iteration. PREDATOR employs an attention mechanism to extract contextual information for learning more distinctive feature descriptors and find soft-correspondences from overlap between a pair of point clouds. REGTR utilizes self-attention and cross-attention to directly predict a consistent set of final point correspondences. All these methods are using deep learning as a feature extraction tool and aim to estimate robust correspondences by the learned distinctive feature. The End-to-end learning-based methods solve registration problem with an end-to-end neural network. The input of the network is a pair of point clouds, and the output is the transformation matrix to align the pair of point clouds. The network not only can extract feature of point cloud, but also can estimate transformation. Different from the network of End-to-end learning-based method, the network of feature learning-based method is separate from the transformation estimation and focuses on feature learning. PCRNet uses PointNet to extract global features, and then connects these features together and provides them as input to the MLP network for regression transformation parameters. DeepGMR37 leverages a neural network to learn pose-invariant point-to-distribution parameter correspondences. Then, these correspondences are fed into the GMM optimization module to estimate the transformation matrix. DGR38 puts forward a 6-dimensional convolutional network architecture for internal likelihood prediction, and estimates the transformation through a weighted Procrustes module.
Problem definition
Consider a pair of point clouds and , we denote as source point cloud and target point cloud, respectively. and denote the number of points in source point cloud P and target point cloud Q, respectively. The objective of point cloud registration is to estimate an unknown rigid transformation consisting of a rotation and a translation , which aligns P to Q.
Methodology
Figure 1 illustrates our overall framework which consists of three main modules: feature extraction and embedding module, SDT module and overlapping correspondence prediction. In feature extraction and embedding module, it extracts feature of a pair of point cloud by a feature extraction network with shared weight, and we also compute local spatial relationships as local position embeddings between points of point cloud after downsampling. In SDT module, it first receives extracted feature and local position embeddings from feature extraction and embedding module, and then iteratively performs deformable self-attention and cross-attention whose purpose is to simulate the process of human browsing back and forth during matching. Deformable self-attention aims to make features more expressive for matching by enhancing local geometric feature representation of a point cloud, and deformable cross-attention aims to compare the similarity between a pair of point clouds by enhancing feature discriminative capability of spatial correspondences. In overlap correspondence prediction module, we first obtain a similarity matrix by matrix operations on the high- high-dimensional feature map from the previous module, and then we add edge slack block for the similar matrix and use Sinkhorn39 algorithm to obtain a soft correspondence confidence matrix, and we transform the soft feature correspondence into a one-to-one point correspondence through the utilization of the Hungarian algorithm40. Finally, RANSAC algorithm is employed to estimate the final transformation relationship between the source point cloud P and target point cloud Q.
Figure 1.
Main framework of our proposed point cloud registration. Feature extraction and embedding downsample the source point cloud P and the target point clouds Q, and learn features in multiple resolution levels and extract local position relationships from these point clouds as their local position embeddings, respectively. Spatial deformable transformer can enhance feature representation by deformable self-attention and can compare the similarity between two features by deformable cross-attention. Overlap correspondences prediction can estimate correspondences between these point clouds in the overlapping region by Sinkhorn and Hungarian algorithm.
Feature extraction and local position embedding
Feature extraction
To effectively utilize the input information of the original point cloud, the feature extraction adopts position adaptive convolution (PAConv)41 and residual network ResNet42 for multilevel resolution feature extraction and fusion. Unlike general convolutional networks, PAConv builds convolution kernels by dynamically assembling basic weight matrices stored in a weight bank, which can better handle irregular and disordered point cloud data and thus improve model registration performance. The backbone network architecture is illustrated in Fig. 2. Input point clouds can be expressed as (, 3) and (, 3), where and are the number of points in source point cloud P and target point cloud Q, respectively, and 3 represents the coordinate dimension of each point. An original source/ target point cloud is input to the feature extraction network and passes through multi-layer ResBlockA and ResBlockB. In ResBlockA consists of a Conv1D convolution layer, a PAConv convolution layer, a Layer-Norm normalized layer, and a Leaky-ReLU activation layer and a shortcut Conv1D convolution layer. In ResBlockB, if parameter strided is set to true, PAConv will downsample the number of points to 1/4 of the number of points in upper-level structure, and the maxpool operation must be executed on the shortcut to ensure them be same dimension. We combine residual connections in the feature extraction backbone network and add multilevel resolved feature maps and convolutional results to achieve multilevel feature fusion. The correlation between and of the point cloud P and Q is finally obtained. The dimension of feature map is (, 1024) and (, 1024) respectively, where is 1/64 of the size of and is 1/64 of the size of .
Figure 2.
Feature extraction network structure. The dimension of original source point cloud is , and the dimension will turn to after executing PAConv(3,64), and will turn to after executing a group of ResBlock marked with green color, and will turn to after executing a group of ResBlock marked purple color, and turn to after executing a group of ResBlock marked pink color, and turn to after executing a group of ResBlock marked red color which does not downsample the number of points since parameter strided in ResBlockB does not set to true.
Local position embedding
The input of local position embedding (LPE) is the result by downsampling the source point cloud P and the target point cloud Q. Based on43, the spatial position relation within the single point cloud is explicitly calculated and is taken as LPE of deformable self-attention in SDT. The spatial position relation of the point clouds is shown in Fig. 3.
Figure 3.
Local Position Relation.
In Fig. 3, and represent two points within a single point cloud, representing the n-th nearest neighbor of and representing the n-th nearest neighbor of . We define a function to describe the distance relationship between two points, and define to represent the Euclidean distance between and , and to represent the distance between and its n-th.
nearest neighbor point, and to represent the distance between and its n-th nearest neighbor point, and to represent the average distance of and its n neighbor points (n is defined as 3 in this paper) , and to represent the average distance of and its n neighbor points, and to represent a constant used to control the sensitivity of distance change, and to represent the dimension of the embedding vector. The function is defined as follow
| 1 |
We define a function to describe the angle relation between three points, where denotes the angle between vector and , and denotes the angle between vector and , and is a constant that controls the sensitivity to angle change, is the dimension of the embedding vector. The function is defined as follow
| 2 |
Finally, the spatial position relation between the and is defined as follow
| 3 |
where is the distance relation between two points, is the angle relation between three points, and are the projection matrices of the distance and angle relations, respectively, and the dimension of is (,, 255).
Spatial deformable transformer
SDT consists of a deformable self-attention module for enhancing local geometric feature expression capability and a deformable cross-attention module for transferring point cloud features whose aim is to compare the similarity between a pair of point clouds. It explicitly receives the local position embeddings and the high-dimensional features, and performs weighted aggregation of the features. In order to improve the computational efficiency, we change the dimension of feature map extracted from feature extract module from 1024 to 256d by linear projection. In the SDT module, these two deformable attention modules are executed iteratively for n times. We conduct extensive experiments and find that setting n to 4 can better and faster aggregate local features of point clouds. The outputs of SDT is and according to (,) and (,) respectively, and their dimensions are (, 256) and (,256) respectively.
(A) Deformable self-attention module
The original attention is used to describe the degree of autocorrelation of input information, and is represented by the attention weight matrix which is calculated by the query vector (Query, Q), key vector (Key, K), value vector (Value, V). Usually, V is weighted based on the relative importance of Q and K to obtain the attention matrix which can be expressed as follow
| 4 |
where is the dimension of the key vector. Attention is also called self-attention if Q, K and V comes from a same feature .
Different from self-attention, deformable self-attention20,44 predicts k position offsets according to query vector Q, and calculates attention score according to Q and those k position of K and V. In this paper, we use continuous position bias (CPB) method proposed in Swin Transformer V245 to generate spatial deformation offset which improves the model's ability to capture local geometric information. is defined by the following formula
| 5 |
where is by default a narrow network with one layer of ReLU activation function between two layers of MLP, and is the relative position offset between the query vector Q at and the key vector K at .
In the following, we describe the computation for (,) and the same goes for (,). Deformable self-attention performs a grouping strategy46,47 on the high-dimensional features to obtain Q, K and V, and perform groups grid sample48 on local position embeddings to obtain G. By performing respectively dot product between Q and K, Q and G, and then adding it to the spatial deformation offset , we obtain the attention score in Deformable Attention
| 6 |
where denotes local position embedding between and ,,, are the projection matrix of the Q, K and G respectively, and is the dimension of the input vector. Based on the obtained attention scores, the output feature matrix of deformable self-attention is the weighted sum of all projected input features
| 7 |
where denotes the weight coefficients computed by a row-wise softmax on the attention score , and denotes the projection matrix of V. Figure 4 shows deformable self-attention module, in which left part is the construction of deformable self-attention and right part is the computation graph of deformable self-attention.
Figure 4.
Deformable Self-Attention Module.
The deformable self-attention module transforms the global feature interactions in traditional self-attention into local feature interactions which adapt to different geometric constructure of point cloud. By the deformable self-attention module, the model can adaptively learn the local geometric spatial information within the point cloud to enhance the representation of features and hence improve the accuracy of point-to-point matching.
(B) Deformable cross-attention module
A typical step in the point cloud registration task is deformable cross-attention module, which is used to exchange global features between points and then obtain the similarity between a pair of point clouds. Given the deformable self-attention feature matrix and of the source point cloud P and target point cloud Q, the feature correlation of relative to can be expressed by
| 8 |
where , are the projection matrices of the query vector Q and the key vector K respectively, and is the dimension of the input vector. Then, deformable cross-attention feature matrix of relative to can be denoted as follow
| 9 |
in which is computed by a row-wise softmax on the attention score , and denotes the projection matrix of V. Figure 5 shows deformable self-attention module, in which left part is the construction of deformable cross-attention and right part is the computation graph of deformable cross-attention. The deformable cross-attention feature matrix of relative to are computed in the same way, resulting in a more robust and discriminative feature representation after feature interactions.
Figure 5.
Deformable Cross-Attention Module.
Overlap correspondence prediction
Overlap correspondence prediction module receives the output of SDT and , and unifies their dimensions as (max(,), 256) by bilinear interpolation49. So, cosine similarity matrix S can be defined as follow
| 10 |
where and denote respectively a feature in and . Before generating cosine similarity matrix S, we can normalize respectively each feature in and , so the value of and are both 1, and S also can be defined as follow
| 11 |
We can obtain initially correspondences between source point cloud P and target point cloud Q from cosine similarity matrix S following a certain principle, for example and are a pair of points if the value of is greater than a certain threshold. However, this approach will make a feature point in correspond to multiple feature points in , and will raise a lot of mismatching pairs which can decrease accuracy and robustness of registration. In response to the above issues, based on Dustbin mechanism of SuperGlue50, we add Edge Slack Block to normalized cosine similarity matrix S, and utilize the Sinkhorn algorithm on S to compute a soft correspondence confident matrix. Finally, we use Hungarian algorithm on the soft correspondence confident matrix to obtain a hard one-to-one correspondence confident matrix . The process of overlap correspondence prediction is shown as Fig. 6.
Figure 6.
Process of overlap correspondence prediction.
Loss
Inspired by UTOPIC51, we construct a supervised loss function via real correspondences based on the -balance cross-entropy loss20. The formula of the -equilibrium cross-entropy loss is defined as follow
| 12 |
where is a confidence matrix and denotes point-to-point correspondences, is the balancing factor that resolves the imbalance of correspondences, is used on the correct correspondences, and is used on the incorrect correspondences. Based on the -balance cross-entropy loss, we add modulation factor to obtain the correct corresponding loss
| 13 |
Similarly, the incorrect corresponding loss is defined as follow
| 14 |
Finally, we obtain the total loss of the model as follow
| 15 |
where and represent the number of points after downsampling of source point cloud P and target point cloud Q, respectively, indicating the correct corresponding loss function and is the incorrect corresponding loss function. For registration data in 3DMatch and 3DLoMatch scenarios, we set as 0.25 and set as 2 according to52. For registration data in Modelnet40 scenarios, we set as 0.45 and set as 2.5.
Figure 7 shows curve of loss function for 40 epochs on Modelnet40 and 3DMatch under learning rate 0.0001 and decay 0.005. It is clearly note that the loss function continues to decrease for 3DMath and ModelNet40 as the number of epochs increases. Loss function on ModelNet40 converge to 0.5 after 20 epochs training and Loss function on 3DMatch converges to 0.9 after 16 epochs training.
Figure 7.

Curve of loss function.
Experiments
Experimental dataset and parameter setting
We evaluate SDT via the publicly available 3DMatch and ModelNet40 datasets. 3DMatch dataset contains 62 scenarios, in which 46 scenarios are used for training set, 8 scenarios are used for validation set, and 8 scenarios are used for testing. ModelNet40 dataset contains 40 CAD models from different classes, with the top 20 classes used for training and validating, rest 20 classes for testing. In our experiment, the data with overlap between 10 and 30% on 3DMatch are used as 3DLoMatch, and the average overlap is below 53.6% on ModelNet40 are used as ModelLoNet40, and these data are used to test the effect of our proposed method in the low overlap scenario. During training, the AdamW optimizer is used with the initial learning rate set to 0.0001 and learning rate decay is used to reduce learning rate to ensure better model convergence after 40 epochs. The model was trained and tested via PyTorch framework on a server equipped with an NVIDIA GeForce RTX 3090 GPU.
Model evaluation metrics
We evaluate our method on 3DMatch and ModelNet40 datasets, and generalize directly the training model to low-overlap 3DLoMatch and ModelNet40 datasets, and compare the indicators of our proposed method with state-of-the-art registration methods. Three metrics from12 are used to evaluate the performance of our proposed method on 3DMatch dataset: (1) FMR (Features Matching Recall), the fraction of point cloud pairs whose inlier ratio exceeds a certain threshold; (2) IR (Inlier Ratio), the fraction of estimated correspondences whose residuals are below a certain threshold under the ground-truth transformation; (3) RR (Registration Recall), the fraction of point cloud pairs whose transformation error is smaller than a certain threshold. We evaluate the performance of our proposed method on ModelNet40 dataset by (1) RRE (Relative Rotation Error), the geodesic distance between estimated and ground-truth rotation matrices; (2) RRE (Relative Translation Error), the geodesic distance between estimated and ground-truth translation vectors; (3) CD (Chamfer Distance), a sum of positive distances between a pair of aligned point clouds.
Comparison of the experiments
Table 1 estimates the performance of SDT and state-of-the-art registration methods, such as FCGF, D3Feat Predator, CoFiNet 53. It is obvious from Table 1 that (1) On 3DMatch dataset, the FMR of our SDT is only slightly lower than CoFiNet, the IR of our SDT is lower than Predator, and the RR of our SDF outperforms all other methods; (2) On 3DLoMatch dataset, all metrics of our SDT outperform other methods, FMR of our SDT has a 3.7% higher than that of CoFiNet, IR of our SDT has a 2.2% higher than that of Predator, and RR of our SDT has a 3.9% higher than that of CoFiNet. All these experimental results show that our SDT can effectively register point clouds and is more robust and accurate to register point clouds with low-overlap. The registration graphs of our SDT and Predator on the 3DMatch and 3DLoMatch datasets are shown in Fig. 8, where diagrams in 1st and 2nd rows display some raw data from 3DMatch with 48.3% and 73.4% overlap and their registration results, and diagrams in 3rd and 4th rows display some raw data from 3DLoMatch with 29.0% and 21.5% and their registration results. It is obvious that our SDT can distinguish similar objects at different positions (see the comparison of Predator and SDT in the 3rd and 4th columns), and recognize small overlapping regions in complex environment thanks to local significant features obtained from the deformable self-attention and cross-attention.
Table 1.
Comparison of FMR, IR and RR (%) of different methods.
| Method | FMR | IR | RR | |||
|---|---|---|---|---|---|---|
| 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch | |
| FCGF | 97.4 | 75.9 | 56.9 | 22.0 | 87.3 | 41.7 |
| D3Feat | 95.8 | 67.4 | 40.7 | 15.5 | 84.9 | 46.9 |
| Predator | 96.6 | 77.9 | 73.8 | 37.8 | 90.6 | 62.4 |
| CoFiNet | 98.1 | 83.1 | 52.2 | 26.9 | 89.3 | 67.5 |
| SDT (ours) | 97.5 | 86.8 | 67.3 | 40.0 | 91.0 | 71.4 |
Significant values are in bold.
Figure 8.
Registration Results on 3DMathc and 3DLoMatch.
Due to structural differences for 8 test scenarios of 3DMatch and 3DLoMatch, the features obtained by different method are also extremely different for these different scenarios. Tables 2 and 3 show the comparison results of RR between different methods on 8 test scenarios of 3DMatch and 3DLoMatch, respectively. Experimental results show that our SDT outperforms other methods. In detail, on 3DMatch, RR of our SDT outperforms most scenarios, especially the hard scenarios such as Home_2 and Lab, and our SDT has a most mean RR; on 3DLoMatch, RR of our SDT is only lower that of CoFinet on Home_1 and Study, and has a most mean RR. These experimental results further show that our SDT together with overlap correspondences prediction is not only robust, but also accurate registration.
Table 2.
Comparison of RR (%) of different methods on 3DMatch.
| Method | 3DMatch(RR) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Kitchen | Home_1 | Home_2 | Hotel_1 | Hotel_2 | Hotel_3 | Study | Lab | Mean | |
| FCGF | 98.0 | 94.3 | 68.6 | 96.7 | 91.0 | 84.6 | 76.1 | 71.1 | 85.1 |
| D3Feat | 96.0 | 86.8 | 67.3 | 90.7 | 88.5 | 80.8 | 78.2 | 64.4 | 81.6 |
| Predator | 97.6 | 97.2 | 74.8 | 98.9 | 96.2 | 88.5 | 85.9 | 73.3 | 89.0 |
| CoFiNet | 96.2 | 99.1 | 73.2 | 95.8 | 91.2 | 84.6 | 89.9 | 84.4 | 89.3 |
| SDT (ours) | 97.6 | 96.8 | 81.2 | 98.4 | 89.1 | 89.1 | 90.7 | 87.3 | 91.3 |
Significant values are in bold.
Table 3.
Comparison of RR (%) of different method on 3DLoMatch.
| Method | 3DLoMatch (RR) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Kitchen | Home_1 | Home_2 | Hotel_1 | Hotel_2 | Hotel_3 | Study | Lab | Mean | |
| FCGF | 60.8 | 42.2 | 53.6 | 53.1 | 38.0 | 26.8 | 16.1 | 30.4 | 40.1 |
| D3Feat | 49.7 | 37.2 | 47.3 | 47.8 | 36.5 | 31.7 | 15.7 | 31.9 | 37.2 |
| Predator | 71.5 | 58.2 | 60.8 | 77.5 | 64.2 | 61.0 | 45.8 | 39.1 | 59.8 |
| CoFiNet | 74.1 | 67.5 | 64.4 | 81.7 | 65.5 | 63.1 | 54.8 | 68.1 | 67.4 |
| SDT (ours) | 85.5 | 64.0 | 71.6 | 87.6 | 71.7 | 66.7 | 54.0 | 70.0 | 71.4 |
Significant values are in bold.
In order to verify the robustness of our SDT at different sample points, the number of sampling points provided to network is gradually reduced in our experiment and RR of different methods are shown on Table 4. In all other cases, only when sample points on 3DModel are 1000, RR of Predator is slightly higher than our SDT, and whatever the number of sampling points on 3DLoMatch is, RR of our SDT outperforms all other methods. At the same time, the experimental results also show that our SDT is relatively robust to the number of different sampling points, even when the number of sample points is only 250.
Table 4.
Comparison of RR (%) of different algorithms at different corresponding sampling points.
| Method | 3DMatch (RR) | 3DLoMatch(RR) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5000 | 2500 | 1000 | 500 | 250 | Mean | 5000 | 2500 | 1000 | 500 | 250 | Mean | |
| FCGF | 85.1 | 84.7 | 83.3 | 81.6 | 71.4 | 81.2 | 40.1 | 41.7 | 38.2 | 35.4 | 26.8 | 36.4 |
| D3Feat | 81.6 | 84.5 | 83.4 | 82.4 | 77.9 | 82.0 | 37.2 | 42.7 | 46.9 | 43.8 | 39.1 | 42.0 |
| Predator | 89.0 | 89.9 | 90.6 | 88.5 | 86.6 | 88.9 | 59.8 | 61.2 | 62.4 | 60.8 | 58.1 | 60.5 |
| CoFiNet | 89.3 | 88.8 | 88.7 | 87.8 | 87.0 | 88.3 | 67.3 | 66.9 | 64.5 | 63.1 | 62.0 | 64.8 |
| SDT (ours) | 91.0 | 90.4 | 90.3 | 90.5 | 90.1 | 90.5 | 71.3 | 71.0 | 70.7 | 71.4 | 70.1 | 70.9 |
Significant values are in bold.
To further verify the generalization ability of our proposed method, we use first 20 categories of ModelNet40 dataset to train model and perform model test on left 20 unseen categories of ModelNet40 via trained model. Table 5 shows RRE, RTE, CD of our SDT and other methods on unseen categories. It is clearly shown that the performance of our SDT is as good as that of REGTR, and is better than that of DCP-v254, RPM-Net55, and Predator. Experimental results also show that our SDT has strong generalization ability and better registration in low overlap scenarios. The registration graphs of our SDT and Predator on the ModelNet40 and ModelLoNet40 are shown in Fig. 9, where diagrams in 1st and 2nd rows display some raw data from ModelNet40 and their registration results, and diagrams in 3rd and 4th rows display some raw data from ModelLoNet40 and their registration results. It is obvious that our SDT outperform Predator (see the comparison of Predator and SDT in the 3rd and 4th row) on ModelLoNet40, which thanks to local significant features obtained from the deformable self-attention and cross-attention.
Table 5.
Point cloud registration experiment with unknown object category.
| Method | ModelNet40 | ModelLoNet40 | ||||
|---|---|---|---|---|---|---|
| RRE | RTE | CD | RRE | RTE | CD | |
| DCP-v2 | 11.975 | 0.171 | 0.0117 | 16.501 | 0.300 | 0.0268 |
| RPM-Net | 1.712 | 0.018 | 0.00085 | 7.342 | 0.124 | 0.0050 |
| Predator | 1.739 | 0.019 | 0.00089 | 5.235 | 0.132 | 0.0083 |
| REGTR | 1.476 | 0.014 | 0.00079 | 3.934 | 0.088 | 0.0038 |
| SDT (ours) | 1.614 | 0.013 | 0.00085 | 3.915 | 0.078 | 0.0041 |
Significant values are in bold.
Figure 9.
Registration Result on ModelNet40 and ModelLoNet40.
Ablations experiments
We perform ablation experiments on 3DMatch dataset to explore the impact of different components of our SDT on the registration results. Specifically, we test the effect of our network at three different modules.
DGCNN / no Sinkhorn. We use DGCNN to replace our FE in the process of feature extract and embedding, and Sinkhorn algorithm is removed from the overlap correspondences predict module.
DGCNN / Sinkhorn. We use DGCNN to replace our FE in the process of feature extract and embedding, and Sinkhorn algorithm is added to the overlap correspondences predict module.
Graph neural network / no Sinkhorn. In SDT module, graph neural network is used to replace SDT module, and Sinkhorn algorithm is removed from the overlap correspondences predict module.
Graph neural network / Sinkhorn. In SDT module, graph neural network is used to replace SDT module, and Sinkhorn algorithm is added to the overlap correspondences predict module.
Self/cross attention based/no-Sinkhorn. In SDT module, the original self-attention is used to replace the deformable self-attention in the SDT module, and the original cross-attention is used to replace the deformable cross-attention in the SDT module, and Sinkhorn algorithm is removed from the overlap correspondences predict module.
Self/cross attention-Based/ Sinkhorn. In SDT module, the original self-attention is used to replace the deformable self-attention in the SDT module, and the original cross-attention is used to replace the deformable cross-attention in the SDT module, and Sinkhorn algorithm is added to the overlap correspondences predict module.
Deform-self-Attention-Base/no-Sinkhorn. Sinkhorn algorithm is added to the overlap correspondences predict module.
Ours model. Deformable self-Attention is added to the SDT module, and Sinkhorn algorithm is added to the overlap correspondences predict module.
The results in Table 6 demonstrate that our SDT is essential for solving rigid transformations in low overlap scenarios. Among them, our proposed SDT has the best registration performance, whose FMR and RR has raised more 6% on 3DLoMatch. Meanwhile, the results in Table 6 also demonstrate that deformable attention and attention will be greater to improve FMR, IR and RR in point cloud registration comparing to graph neural network and DGCNN. Finally, it is worth noting that the addition of Sinkhorn algorithm in the DGCNN, graph neural network, self/cross attention and deformable self/cross attention can improve the registration effect.
Table 6.
Comparison of FMR, IR and RR on network model.
| Model | FMR | IR | RR | |||
|---|---|---|---|---|---|---|
| 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch | |
| DGCNN/no Sinkhorn | 93.2 | 80.7 | 47.5 | 25.2 | 82.1 | 64.2 |
| DGCNN/Sinkhorn | 94.0 | 81.6 | 54.7 | 30.3 | 84.3 | 67.1 |
| Graph neural network/no Sinkhorn | 95.1 | 82.1 | 48.2 | 26.6 | 84.3 | 64.7 |
| Graph neural network/Sinkhorn | 95.9 | 82.9 | 54.5 | 29.2 | 87.5 | 67.8 |
| Self/Cross attention based/no Sinkhorn | 96.1 | 83.3 | 48.8 | 30.5 | 86.7 | 66.9 |
| Self/Cross attention based/Sinkhorn | 97.0 | 83.8 | 64.3 | 35.7 | 88.7 | 68.2 |
| Deformable self/cross attention Based/no Sinkhorn | 97.2 | 85.9 | 65.1 | 37.1 | 90.4 | 70.1 |
| SDT (ours) | 97.5 | 86.8 | 67.3 | 40.0 | 91.0 | 71.4 |
Significant values are in bold.
Following that, we investigate the design of geometric structure embedding from the following aspects.
Number of neighbor points.We change the number of nearest neighbors to compute the triplet-wise distance/angular embedding of or in Fig. 3.
. It represent a constant used to control the sensitivity of distance change in formula (1) and we change the hyper-parameter to test its impact on registration performance.
. It represent a constant used to control the sensitivity of angular change in formula (2) and we change the hyper-parameter to test its impact on registration performance.
The results in Table 7 demonstrate that the impact of hyper-parameters on registration performance. According to Table 7, it is clear that the model with both distance and angular embeddings outperforms the model with simply distance embedding by a significant margin, which aligns with our goal. Increasing the number of neighbors will increases registration performance by providing exact structural information when the number of neighbor points is less than or equal to 3 , but registration performance whill decrease when the number of neighbor points is greater than 3, which shows the geometric structure formed by a reference point and its closest three neighbor points is the most robust to noise and the highest invariant to rigid transformation. Meanwhile, it is noted from Table 7 that the best results are obtained around 0.2 for and 10°for . A too small (where the embedding is too sensitive to distance changes) or too large (where the embedding neglects small distance variations) could harm the performance, but the differences are not significant. And similar observations can be obtained for the angular changes . Nevertheless, all of these models outperforms pervious methods by a large margin, indicating that our proposed SDT is still robust to the distance/angular hyper-parameters.
Table 7.
Comparison of FMR, IR and RR on hyper-parameters.
| Model | FMR | IR | RR | ||||
|---|---|---|---|---|---|---|---|
| Name | Value | 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch | 3DMatch | 3DLoMatch |
| Number of neighbor points | 0 | 92.0 | 82.1 | 63.1 | 35.3 | 84.7 | 66.1 |
| 1 | 95.9 | 84.1 | 62.6 | 33.7 | 87.0 | 68.5 | |
| 3 | 97.0 | 83.8 | 64.3 | 35.7 | 88.7 | 68.2 | |
| 5 | 95.9 | 84.2 | 63.4 | 33.5 | 88.3 | 67.5 | |
| 7 | 96.1 | 83.8 | 60.0 | 33.8 | 85.8 | 66.5 | |
| 0.1 | 95.6 | 83.7 | 61.4 | 33.8 | 90.7 | 69.2 | |
| 0.2 | 96.4 | 83.8 | 63.0 | 37.4 | 91.8 | 69.5 | |
| 0.3 | 97.5 | 86.8 | 67.3 | 40.0 | 91.0 | 71.4 | |
| 0.4 | 96.1 | 87.0 | 65.3 | 36.0 | 87.8 | 70.0 | |
| 0.5 | 95.8 | 86.3 | 63.3 | 34.1 | 85.0 | 70.2 | |
| 5° | 95.9 | 80.4 | 66.9 | 39.4 | 90.3 | 70.6 | |
| 10° | 97.5 | 86.8 | 67.3 | 40.0 | 91.0 | 71.4 | |
| 15° | 96.0 | 86.4 | 68.7 | 39.6 | 91.8 | 73.2 | |
| 20° | 95.9 | 86.0 | 67.3 | 39.0 | 91.4 | 70.6 | |
| 25° | 96.1 | 85.7 | 66.6 | 39.3 | 91.2 | 70.1 | |
Significant values are in bold.
Efficiency
We compare the inference time of several methods on a desktop computer equipped with an Intel I7-12700 CPU, an Nvidia GTX 3060 GPU, and 32G memory. Computational time is measured in seconds and calculated by averaging 100 results. As shown in Table 8, FCGF is the fastest method among these methods, and RPM-net, D3Feat are also faster than our proposed SDT, which is because their network is relatively simple and none of them adopt Transformer structure. Our proposed SDT is faster than DCP-v2, Predator, REGTR and CoFiNet, in which the former leverages deformable self/cross attention and the latter utilize Transformer structure with self/cross attention. From the results of the Table 8, it is clearly showed that Transformer with deformable self/cross attention (such as our proposed SDT) has a higher time efficiency than Transformer with self/cross attention.
Table 8.
Inference time (in seconds).
| Model | ModelNet | 3DMatch |
|---|---|---|
| FCGF | 0.16 | 0.17 |
| RPM-Net | 0.19 | 0.22 |
| D3Feat | 0.21 | 0.28 |
| DCP-v2 | 0.74 | 0.85 |
| Predator | 0.30 | 0.38 |
| REGTR | 0.42 | 0.51 |
| CoFiNet | 0.83 | 0.94 |
| SDT (ours) | 0.24 | 0.29 |
Significant values are in bold.
Conclusion
We propose a 3D point cloud registration method based on SDT. First, we propose a feature extraction and embedding module to extract basic features of point cloud and compute local spatial relationships between points in the point cloud as local positional embedding of basic feature, and formulate a SDT module to fuse and enhance above two kinds of information into new feature of point cloud by Self-Attention and Cross-Attention mechanisms. Second, we develop an overlap correspondence predict module to obtain correspondences between the pairwise point clouds by a series of handle for above new features of a pair of point clouds. Finally, we construct an -balance cross-entropy loss based on real correspondences of pairs of point clouds to train our unsupervised network, and we use outputs of this network to generate transformation matrices of pairs of point clouds via RANSAC algorithm. Extensive experimental results on the 3DMatch/3DLoMatch and ModelNet40/ModelLoNet40 demonstrate that our proposed method has high accuracy and strong robustness in solving point cloud registration problems in low overlap scenarios. Unfortunately, this leads to longer training and calculation times because the model uses the SDT module and the RANSAC approach simultaneously. Improving model performance, developing more efficient feature extraction and aggregation techniques, and extending our approach to more complicated scenarios need continue to study in the future.
Ethical and informed consent
Data used in our study are publicly available, and ethical approval and informed consent were obtained in each original study.
Acknowledgements
The authors thank the anonymous reviewers for their detailed comments and suggestions, which resulted in the improvement of this paper. This work was supported in part by Supported by National Natural Science Foundation of China under Grant 62272426, and in part by Shanxi Province Science and Technology Major Special Plan "Unveiling and Leading" Project under Grant 202201150401021, and in part by Shanxi Provincial Natural Science Foundation under Grant 202203021212138.
Author contributions
Xiong Fengguang: Methodology Providing and Implementing, Original Draft Writing. Kong Yu: Manuscript Reviewing and Reediting, Xie Shuaikang: Experimentation Testing. Liqun Kuang: Manuscript Reviewing and Funding acquisition. Xie Han: Funding acquisition and Resources.
Data availability
The datasets generated during and/or analyzed during our study are available from the corresponding author on reasonable request.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rodriguez-Lozano FJ, Gámez-Granados JC, Martínez H, et al. 3D reconstruction system and multiobject local tracking algorithm designed for billiards. Appl. Intell. 2023;53:21543–21575. doi: 10.1007/s10489-023-04542-3. [DOI] [Google Scholar]
- 2.Sun C, Miao L, Wang M, et al. Research on point cloud hole filling and 3D reconstruction in reflective area. Sci. Rep. 2023;13:18524. doi: 10.1038/s41598-023-45648-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liu J, Gao Y, Jiang X, et al. Online object-level SLAM with dual bundle adjustment. Appl. Intell. 2023;53:25092–25105. doi: 10.1007/s10489-023-04854-4. [DOI] [Google Scholar]
- 4.Zhu J, Li H, Zhang T. Camera, LiDAR, and IMU based multi-sensor fusion SLAM: A survey. Tsinghua Sci. Technol. 2024;29(2):415–429. doi: 10.26599/TST.2023.9010010. [DOI] [Google Scholar]
- 5.Tao C, Bian W, Wang C, et al. 3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving. Springer; 2023. [Google Scholar]
- 6.He X, et al. Fear-neuro-inspired reinforcement learning for safe autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2024;46(1):267–279. doi: 10.1109/TPAMI.2023.3322426. [DOI] [PubMed] [Google Scholar]
- 7.Besl PJ, McKay ND. Method for registration of 3-D shapes[C]//Sensor fusion IV: control paradigms and data structures. Spie. 1992;1611:586–606. [Google Scholar]
- 8.Yang J, Li H, Jia Y. Go-ICP: Solving 3d registration efficiently and globally optimally[C]//Proc. of the IEEE International Conference on Computer Vision 1457–1464 (2013).
- 9.Choy C, Park J, Koltun V. Fully convolutional geometric features[C]//Proc. of the IEEE/CVF International Conference on Computer Vision 8958–8966 (2019).
- 10.Zhang Z, Liu Q, Wang Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018;15(5):749–753. doi: 10.1109/LGRS.2018.2802944. [DOI] [Google Scholar]
- 11.Mateus A, Ranade S, Ramalingam S, et al. Fast and accurate 3D registration from line intersection constraints. Int. J. Comput. Vis. 2023;131:2044–2069. doi: 10.1007/s11263-023-01774-1. [DOI] [Google Scholar]
- 12.Ao S, Hu Q, Yang B, et al. Spinnet: Learning a general surface descriptor for 3d point cloud registration[C]// Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11753–11762 (2021).
- 13.X. Bai, Z. Luo, L. Zhou, et al. D3feat: Joint learning of dense detection and description of 3d local features[C]// Proc. CVPR, 2020: 6359–6367.
- 14.Thomas H, Qi C R, Deschaud J E, et al. Kpconv: Flexible and deformable convolution for point clouds[C]//Proc. of the IEEE/CVF International Conference on Computer Vision 6411–6420 (2019).
- 15.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [J]. Advances in neural information processing systems, 2017, 30.
- 16.Huang S, Gojcic Z, Usvyatsov M, et al. Predator: Registration of 3d point clouds with low overlap[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4267–4276.
- 17.Li Y, Harada T. Lepard: Learning partial point cloud matching in rigid and deformable scenes[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5554–5564 (2022).
- 18.Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography [J] Commun. ACM. 1981;24(6):381–395. doi: 10.1145/358669.358692. [DOI] [Google Scholar]
- 19.Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION.ICLR 2021 - 9th International Conference on Learning Representations.
- 20.Xia Z, Pan X, Song S, et al. Vision transformer with deformable attention[C]//Proc. of the IEEE/CVF conference on computer vision and pattern recognition 4794–4803 (2022).
- 21.Segal A, Haehnel D, Thrun S. Generalized-icp[C]//Robotics: Science and Systems. Seattle; 2009. [Google Scholar]
- 22.Bouaziz S, Tagliasacchi A, Pauly M. Sparse iterative closest point[C]//computer graphics forum. Oxford UK Blackwell Publ. Ltd. 2013;32(5):113–123. [Google Scholar]
- 23.Pomerleau F, Colas F, Siegwart R. A review of point cloud registration algorithms for mobile robotics. Found. Trends® Robot. 2015;4(1):1–104. doi: 10.1561/2300000035. [DOI] [Google Scholar]
- 24.Rusu, R B, Blodow, N, Marton, Z C, et al. Aligning point cloud views using persistent feature histograms[C]// IEEE/RSJ international conference on intelligent robots and systems IEEE 3384–3391 (2008).
- 25.Rusu R B, Blodow N, Beetz M. Fast point feature histograms (FPFH) for 3D registration[C]//2009 IEEE international conference on robotics and automation. IEEE 3212–3217 (2009).
- 26.Salti S, Tombari F, Di Stefano L. SHOT: Unique signatures of histograms for surface and texture description [J] Comput. Vision Imag. Underst. 2014;125:251–264. doi: 10.1016/j.cviu.2014.04.011. [DOI] [Google Scholar]
- 27.Guo Y, Sohel F A, Bennamoun M, et al. RoPS: A local feature descriptor for 3D rigid objects based on rotational projection statistics[C]//2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA) IEEE 1–6 (2013).
- 28.do Monte Lima J P S, Teichrieb V. An efficient global point cloud descriptor for object recognition and pose estimation[C]//2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE 56–63 (2016).
- 29.Deng Z, Yao Y, Deng B, et al. A robust loss for point cloud registration[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision 6138–6147 (2021).
- 30.Ao S, Hu Q, Wang H, et al. BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 1255–1264 (2023).
- 31.MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification, International Journal of Intelligent Systems vol. 36, no. 12, pp. 7204–7238 (2021).
- 32.Fine-grained modulation classification using multi-scale radio transformer with dual-channel representation, IEEE Communications Letters vol 26 no. 6, pp 1298–1302 (2022).
- 33.Application of wavelet-packet transform driven deep learning method in PM2. 5 concentration prediction: A case study of Qingdao, China Sustainable Cities and Society 92 104486 (2023).
- 34.DL-PR: Generalized automatic modulation classification method based on deep learning with priori regularization, Engineering Applications of Artificial Intelligence 122 106082 (2023).
- 35.Yew Z J, Lee G H. Regtr: End-to-end point cloud correspondences with transformers[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6677–6686 (2022).
- 36.Sarode V, Li X, Goforth H, et al. Pcrnet: Point cloud registration network using pointnet encoding[J]. arXiv preprint arXiv:1908.07906, 2019.
- 37.Yuan W, Eckart B, Kim K, et al. Deepgmr: Learning latent gaussian mixture models for registration[C]// In: Yuan W, Eckart B, Kim K, et al., editors. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Part V 16. Springer International Publishing; 2020. pp. 733–750. [Google Scholar]
- 38.Choy C, Dong W, Koltun V. Deep global registration[C]//Proc. of the IEEE/CVF conference on computer vision and pattern recognition 2514–2523 (2020)
- 39.Sinkhorn R, Knopp P. Concerning nonnegative matrices and doubly stochastic matrices [J] Pac. J. Math. 1967;21(2):343–348. doi: 10.2140/pjm.1967.21.343. [DOI] [Google Scholar]
- 40.Kuhn HW. The Hungarian method for the assignment problem[J] Nav. Res. Logist. Q. 1955;2(1–2):83–97. doi: 10.1002/nav.3800020109. [DOI] [Google Scholar]
- 41.Xu M, Ding R, Zhao H, et al. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3173–3182 (2021).
- 42.He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proc. of the IEEE conference on Computer Vision and Pattern Recognition 770–778 (2016).
- 43.Qin Z, Yu H, Wang C, et al. Geometric transformer for fast and robust point cloud registration[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11143–11152 (2022).
- 44.Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection [J]. Preprint @ http://arXiv.org/2010.04159 (2020).
- 45.Liu Z, Hu H, Lin Y, et al. Swin transformer v2: Scaling up capacity and resolution[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12009–12019 (2022).
- 46.Li J, Lee G H. Usip: Unsupervised stable interest point detection from 3d point clouds[C]//Proc. of the IEEE/CVF International Conference on Computer Vision 361–370 (2019).
- 47.Li J, Chen B M, Lee G H. So-net: Self-organizing network for point cloud analysis[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 9397–9406 (2018).
- 48.Yang Y, Feng C, Shen Y, et al. Foldingnet: Point cloud auto-encoder via deep grid deformation[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 206–215 (2018).
- 49.Wang N. The use of bilinear interpolation filter to remove image noise[C]//journal of physics: Conference series. IOP Publ. 2022;2303(1):012089. [Google Scholar]
- 50.Sarlin P E, DeTone D, Malisiewicz T, et al. Superglue: Learning feature matching with graph neural networks[C]//Proc. of the IEEE/CVF conference on computer vision and pattern recognition 4938–4947 (2020).
- 51.Zhilei C, et al. UTOPIC: Uncertainty-aware overlap prediction network for partial point cloud registration [J] Comput. Gr. Forum. 2023;41(7):87–98. [Google Scholar]
- 52.Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proc. of the IEEE International Conference on Computer Vision 2980–2988 (2017).
- 53.Yu H, Li F, Saleh M, et al. Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration [J] Adv. Neural Inf. Process. Syst. 2021;34:23872–23884. [Google Scholar]
- 54.Wang Y, Solomon J M. Deep closest point: Learning representations for point cloud registration[C]//Proc. of the IEEE/CVF international conference on computer vision 3523–3532 (2019).
- 55.Yew, Z J, Lee, G H. Rpm-net: Robust point matching using learned features[C]//Proc. of the IEEE/CVF Conference onComputer Vision and Pattern Recognition 11824–11833 (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated during and/or analyzed during our study are available from the corresponding author on reasonable request.








