Abstract
Point set is a major type of 3D structure representation format characterized by its data availability and compactness. Most former deep learning-based point set models pay equal attention to different point set regions and channels, thus having limited ability in focusing on small regions and specific channels that are important for characterizing the object of interest. In this paper, we introduce a novel model named Attention-based Point Network (AttPNet). It uses attention mechanism for both global feature masking and channel weighting to focus on characteristic regions and channels. There are two branches in our model. The first branch calculates an attention mask for every point. The second branch uses convolution layers to abstract global features from point sets, where channel attention block is adapted to focus on important channels. Evaluations on the ModelNet40 benchmark dataset show that our model outperforms the existing best model in classification tasks by 0.7% without voting. In addition, experiments on augmented data demonstrate that our model is robust to rotational perturbations and missing points. We also design a Electron Cryo-Tomography (ECT) point cloud dataset and further demonstrate our model’s ability in dealing with fine-grained structures on the ECT dataset.
Keywords: point cloud, attention mechanism, deep neural network
1. Introduction
Point cloud is a main type of geometric data representation of 3D structures. In addition to techniques such as photogrammetry, the rapid development of sensors such as Velodyne spinning Light Detection and Ranging (LIDAR) and tilting laser scanner also makes it drastically easy to collect structural information of the real world using point clouds. This results in broad applications of the combination between photogrammetry and laser scanning techniques. For example, the authors of [1] integrate photogrammetry and laser scanning techniques to model digital 3D dinosaur footprints. Point clouds are easy to learn from because of their expressive and compact representation [2]. Furthermore, compared with volumetric image representations, point cloud takes up significantly less storage when representing the same structure.
In recent years, deep neural network has become a major tool for image analysis. Deep learning is also increasingly popular for analyzing point set data due to its large scale learning capacity. Since the invention of PointNet [3], which directly handles point sets, most recent works extract the global features of a point set by grouping and aggregating features of all the individual points. However, they are limited to detecting the structural differences between distinct objects. Therefore, when confronting similar and complicated structures, the above models may not perform well on classification and segmentation tasks.
For this reason, we aim at specifically handling fine-grained structures by proposing a novel deep learning model named AttPNet (Figure 1). The model is characterized by attention-based global feature masking and channel weighting which correspond to the global attention module and CW-EdgeConv (see Figure 2). The whole end-to-end model (Figure 2) takes N points as an input and learns a global feature for classification and segmentation tasks. There are two main branches in the model. The first branch focuses on the effect of each local point, therefore outputting a global mask at the end of the branch, weighting the contribution of each point to the analysis task. In order to focus on the most discriminative regions of the input structure, we multiply the global feature by the mask to obtain our final attention-based feature. Another branch outputs global geometric information in the form of a two-dimensional tensor by concatenating each point’s feature. We use channel weighting in this branch to focus on informative and distinct channels.
Experiments show that our model outperforms existing models on the most widely used ModelNet40 benchmark dataset. Note that on the ModelNet40 leaderboard, the 93.6% result of RSCNN trains multiple models to vote for the final decision. For a fair comparison, following the practice of most of deep learning papers, we compared our method with other models on ModelNet40 without voting. The key reason why our work outperforms other models is that we innovatively introduce the attention mechanism to point cloud feature extraction. Former models like PointNet [3] and PointNet++ [4] do not distinguish the importance of each point. However, every point plays a unique role in characterizing the overall structure. Therefore, we let our model assign every single point its own weight in the feature integration phase. Moreover, the squeeze-and-excitation operation [5] used for the channel attention in every convolutional layer also makes the model focus on the important channels of features that representing the internal geometric information in high dimensional space.
Our main contributions are summarized as follows.
We propose a novel model named AttPNet which uses attention mechanism for both global feature masking and channel weighting to focus on characteristic regions and channels.
Our model achieved 93.6% accuracy of overall instances on ModelNet40 benchmark dataset without voting and outperforms the existing best point set model by 0.7%. Given that the performance improvement is slow in recent years, the performance improvement of our model is significant.
Experiments show that our model generalizes better on test data with random translation, rotation, and missing points perturbations Table 1.
Table 1.
Method | Translation | R | R | R |
---|---|---|---|---|
Ours | 93.4 | 93.2 | 92.1 | 86.3 |
PointNet++ | 90.6 | 90.3 | 88.6 | 83.8 |
2. Related Work
2.1. Point Cloud Networks
2.1.1. Projection and Voxelization
Before the invention of PointNet [3], the deep learning methods for point clouds can be divided into several types. The most important two techniques are projection and voxelization. The authors of [6,7,8] project 3D point clouds into 2D images from multiple angles of view and feed 2D images into traditional 2D convolutional layers. These approaches dominate for a long period due to efficiency but they are limited by the problem of occluded objects. The authors of [9] propose a method that partially solve object occlusion by aggregating different views from sensors. Voxelization [6,10,11,12] is also a popular type of approach that subsample point clouds into volumetric grids so as to utilize 3D convolutional layers. Such methods are mainly constrained by the inflexible resolution and high computational and storage cost. The authors of [13] propose a novel solution similar to voxelization which projects the point clouds into high-dimension lattice and applies bilateral convolution layers [14]. Splatnet achieves competitive outcomes on several data sets compared with pointnet++ [4]. Octnet [15] use unbalanced octrees to hierarchically partition the space through exploring the sparsity in 3D volumetric data. Each leaf node of the unbalanced octree stores a pooled feature representation.
2.1.2. PointNet & PointNet++
PointNet [3] is a pioneering work that directly consumes point clouds and utilizes symmetric functions such as max pooling to respect the permutation invariance of points. It is highly efficient and achieved better results than previous work. PointNet++ [4] is an improved version of PointNet. By applying hierarchical abstraction layers, it is capable of learning local features with increasing contextual scales and has significantly better results on several benchmark datasets than PointNet.
2.1.3. Graph Networks
The authors of [16] propose a new module (EdgeConv) which acts on graphs dynamically computed in each layer of the network. The design of the dynamic graph module can also learn both local neighborhood information and global shape properties. The architecture of AttPNet model is mainly based on dynamic graph network (DGCNN). Key differences between AttPNet and DGCNN include the extension of original EdgeConv and global attention module. Apart from these two distinct differences, we made some minor structural adjustments to the network such as feature dimension, number of module and the selection of activation function. The authors of [17] use recursive feature aggregation on a nearest-neighbor graph computed from 3D positions to generate local high-dimensional features and also defines a point-set kernel in analogy to 2D convolution kernels for images. The authors of [18] propose a model named GS-Net to deal with data rotation and translation. It adopts Eigen-Graph to collect geometric information from points in a distance. For points in neighbors, this algorithm combines both Euclidean space and Eigenvalue space to generate features.
2.1.4. Point Convolution
Recently, there is an increasing interest in designing convolutions that directly operates on point clouds, inspired by the great performance of CNN on 2D images. To design a point convolution network, the authors of [19,20,21] attempt to construct continuous kernel functions to convolve on local points. PointConv [19] uses a Multi-Layer Perceptron (MLP) to fit a kernel due to its ability to approximate an arbitrary continuous function. It also consumes the inverse density as a feature to convolve with the kernel function. SpiderCNN [20] found that the MLP did not work well on approximating the kernels, so the authors propose the order-3 Taylor term which is a family of polynomial functions applied with different weights to enrich the complexity of the filters. Flex convolution [21] utilizes linear functions to act as a kernel which is actually an order-1 Taylor term of SpiderCNN. Structure-aware Convolution (SAC) [22] matches neighbor points in the point cloud through 3D convolution to extract geometric features. These convolution works all have significant improvements in several data sets but the training and inference time are much longer than PointNet++ (usually double).
Although networks like SpiderCNN [20] and DGCNN [16] incorporate local neighborhood information, these extraction steps are region-wise. Strategies mentioned above work well on classification tasks between distinctive categories. However, they only consider the global and neighbor effect between groups of points but ignore the location and other hidden information of a single point. Our approach, AttPNet, has a point-wise branch to solve this problem.
2.1.5. Sequence Network
The authors of [23] employ a sequence model to capture the correlations by aggregating multi-scale areas of each local region with attention. Point2Sequence utilized LSTM [24] as the main module of the encoder and decoder to highlight the importance of different area scales. However, due to the introduction of LSTM, the model is hard to train and needs more time to converge.
2.2. Attention-Based Methods in Computer Vision
The attention mechanism has been well studied in computer vision and help achieve great improvement in scene analysis [25]. From the era of deep learning, attention mechanism was widely known because of a sequence model [26] in translation to focus on key words in natural language processing. In recent years, it also demonstrated useful in extracting the core information in images. Such approaches can be divided into hard attention and soft attention. The work in [27] uses a classical method of hard attention. The author proposes an APM module to focus and crop the distinct area in fine-grained classification tasks. This hard-attention module only acts in looking again (in comparison to look once in YOLO model) at the crucial area and ignoring all other pixels. To resolve the problem of non-differentiable cropping, many researchers attempt to utilize the soft attention mechanism which learns an alignment weight and place it on all pixels such as [28,29]. From another perspective, the attention mechanism can also be separated into two parts: spatial-domain attention and channel-domain attention. SENet [5], as a championship winner in 2017 ILSVR, utilized SE block, which can be regarded as a channel-domain attention module that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. SENet produces significant performance improvements at little computational cost and initiates the methods on channel-wise recalibration and attention.
There are also few works using the idea of attention mechanism to improve the results on classification and segmentation. The work in [30] includes a simple contextual modeling mechanism to automatically select contextual region and aggregate features. The work in [31] uses a parameter-efficient Group Shuffle Attention (GSA) and develops Point Attention Transformers (PATs) to construct an end-to-end learnable model. The work in [32] introduces a geometry-attentional network which combines features from geometry-aware convolution, attention module and other hierarchical architectures. The work in [33] proposes an local relation learning module based on the attention mechanism in order to extract local features. However, the improvement of these works on point cloud datasets such as ShapeNet and ModelNet40 are limited. The best classification result of the work in [30] on ModelNet40 is 90.0% and best part segmentation result is 84.6% (mean class accuracy) which are both lower than the results of PointNet++ [4]. For the model [31], the classification result on ModelNet40 is 91.7%. The authors of [34] use a Graph Attention Convolution (GAC) to solve semantic segmentation tasks, but their attention mechanism is based on the subgraph of a point cloud and only accept neighbor feature as input. In contrast, our model applies a different design of attention mechanism which combines global feature and channel feature during training process and gains significant improvement.
3. Method
In this section, we first describe the CW-EdgeConv and the global attention module. Then, we overview the whole model for classification and segmentation. Finally, we compare several structures of attention modules.
3.1. Channel Weighting Edge Convolution (CW-EdgeConv) Module
Our CW-EdgeConv module is an extension of EdgeConv and it consists of four steps: (1) calculate k nearest neighbors using kNN query, (2) map low-dimensional geometrical features to high-dimensional features using Multilayer Perceptron (MLP) [35], (3) channel weighting, and (4) aggregate features of nearest neighbor points into features of a single point. The original EdgeConv will be described in the last of this subsection.
The first step is kNN query, which inputs a set of points and calculates the k nearest neighbors for each point. Specifically, consider an point set input , where N is the total number of points and C is the dimension of geometrical features of a point, such as 3D location and normal. Given that our model does not resample points before each CW-EdgeConv layer, the number of points considered remains N. For each , we define a subset centered at it and choose nearest points except the center . Therefore, a kNN query of can be calculated as
(1) |
where is the k th nearest point from , calculated using the kNN query. Therefore, the grouped input can be represented by
(2) |
We apply the kNN query method to group the point set in each layer due to simplicity and less inference time.
The second step is using MLP to map low dimensional geometrical features to high-dimensional features. These low dimensional geometrical features include the edge feature in form of and the original input points , where . The choice of such features strictly follows the best option in EdgeConv [16]. Given such features, we use MLP to calculate high-dimensional features. Specifically, we apply a 2D convolutional layer followed by a batch normalization layer [36] and a ReLU activation function [37]. We use the following notation to represent this convolutional operation of one group.
Note that is shared through all groups in that it works as a nonlinear function to discover the intrinsic features of each group in high dimensional space such as density, mean distance, etc. This is achieved by extracting the correlation of the input geometric features .
For the third step, given the middle features outputted from convolutional layers , we apply channel weighting on these middle features by adapting a squeeze and excitation block (SE-Block2d) [5] layer. The architecture of SE-Block is shown in Figure 3. Here, we simply abbreviate the SE-Block as .
(3) |
where is the number of the output channels of .
We made two modifications to the SE-Block [5]: (1) We adapt a 1d channel weighting model to fit the dimension of the concatenated feature; (2) We keep the original channel size of a layer in the block because the reduction of layer parameters limits the performance of channel weighting.
In the fourth step, we aggregate features of k nearest neighbor points into features of a single point . This is similar to 2D convolution networks that each pixel value should be aggregated from several values of a kernel. Here, we follow the convention of PointNet, PointNet++, and EdgeConv; the aggregation function is instead of ∑. The output for a group centered at is calculated as follows.
(4) |
Finally, the output of the whole CW-EdgeConv is calculated as follows.
(5) |
We denote the output of the l th CW-EdgeConv layer as .
After the output of the last CW-EdgeConv layer, we further utilize an shared MLP and a SE-1d block to obtain the global feature .
(6) |
Remarks: The only difference between first CW-EdgeConv++ layer and following CW-EdgeConv layers is that there are additional geometric features for the global attention module (see more detail in the next subsection). The form of this additional output is represented as
(7) |
where , , denotes the euclidean distance, and k specifies the number of points in a group.
Remarks: The original EdgeConv module only contains step 1, 2, and 4 of CW-EdgeConv. Compared with Equation (4), the output for a group centered at in EdgeConv can be calculated as
(8) |
3.2. Global Attention Module
The input of this module is the output of the CW-EdgeConv++ module (see Figure 2). Similar to the channel attention in SENet [5], we utilize two 2D convolutional layers to reduce the dimensions of grouped features (the input of this module) and one sigmoid function to generate the soft attention mask (Figure 2). For a specific point group centered at , the importance is calculated as
(9) |
where the number of output channels of is one and Sigmoid denotes the sigmoid activation function which is . Finally, the module outputs a learned soft mask .
The motivation of this design is simple: We consider the classification task as an example. Each object class has its characteristic patterns that make it distinct from other classes. Examples of such characteristic patterns include the string of guitars, the wings of airplanes, etc. Such characteristic patterns may be neglected due to excessive amount of features extracted during the pooling aggregation process. Therefore, it is necessary to measure the importance of each group and use such to weight the global feature by our learned soft mask .
The reason why we feed more pivotal geometric information (i.e., in Equation (7)) into the global attention module is to accelerate and improve the learning of the global soft mask . Though MLP can approximate any nonlinear functions theoretically such as high-order information like the square of the euclidean distance (2-order: ) from a group, experiments show that the model with high-order convolutional filters such as can achieve higher classification accuracy in several benchmarks [20]. To resolve this same problem in our model, inspired by this idea, we here feed additional pivotal geometric information (i.e., in Equation (7)) to assist the shared MLP to efficiently find features of characteristic patterns and determine the importance of every input point .
In summary, this module aims at automatically discovering the characteristic patterns of point clouds and generate a point-wise soft attention mask to multiply the global feature .
3.3. Architecture for Classification and Segmentation
After obtaining the mask from the global attention module and the global feature , we operate an element-wise multiplication between them and utilize the ReLU activation function to generate the new global feature denoting the after being masked.
For classification (Figure 2), we use both max-pooling and average-pooling to aggregate all points in global feature and concatenate them. Finally, we use a 3-layer MLP to output the classification scores. C, C/R, and C are dimensions of three neural layer of the MLP, respectively, where R is the reduction factor to reduce the amount of parameters.
For segmentation, similar to other approaches, we first tile the one-hot category label and concatenate it with the global feature and the output of ReLU and max-pooling on (Figure 2). The following 4-layer MLP eventually outputs the point-wise segmentation scores.
The selection of aggregation function through all points was actually discussed in a few researches [16]. Most models use the max-pooling other than average-pooling layer due to the convention inherited from PointNet. Intuitively, the max-pooling ought to be better than avg-pooling because the strongest activation is probably the most prominent feature of one class. However, the outcome of avg-pooling can also reflect an important trait of a class; otherwise, the models using avg-pooling will not have a reasonable result. In order to gather more valuable information, we choose to concatenate both results from avg-pooling and max-pooling layers into a complete vector for classification whose dimension is 2048.
3.4. Alternative Attention Modules
Inspired by convolution neural network models on 2D images and sequence models, we propose two other modules of attention mechanism (Figure 4) on point clouds as follows.
3.4.1. Global Hard-Attention Module
From Figure 4a, we can see that the global hard-attention module is similar to the one in Figure 2. The input is still represented by as mentioned above. First, we use shared MLP and sigmoid function to condense the high-dimensional geometric features. Then, we apply an average-pooling layer to extract the mean response of all features in a group.
(10) |
In order to suppress all redundant points, we construct the boolean mask by comparing the and 0.5 as below.
(11) |
Before operating the element-wise multiplication, we expand the channel dimension of the boolean mask as the same size of the global feature .
3.4.2. Spatial-Attention Edgeconv
As the global attention module only takes into account the low-dimensional geometric information, the performance may be limited by the lack of high-dimensional intrinsic features in all groups. Therefore, we propose a Spatial-Attention EdgeConv module by further integrating point-wise attention with EdgeConv, as shown in Figure 4b.
Specifically, following the notation in the CW-EgdeConv subsection, we have the input point set . Furthermore, we denote the as the group centered at by kNN query method. The output of the upper branch of Figure 4b is written as following, which is identical to EdgeConv.
(12) |
(13) |
In the lower branch of Figure 4b, we first concatenate geometric information represented by . Unlike the first CW-EdgeConv++ layer in Figure 2, does not include the euclidean distance due to the problem of gradient explosion. In practice, we found that the loss would become NaN after several mini-batches because of the numerical instability when computing the gradient of high-dimensional distance. We calculate the point-wise soft mask as follows.
(14) |
(15) |
In our implementation, contains a batch norm layer after the shared MLP.
Finally, combining the output of the two branches, the output of this whole module is calculated as
(16) |
where represents the soft mask being expanded as the same size of . Though the amount of parameters of Spatial-Attention EdgeConv seems to be less than CW-EdgeConv, the computational cost is more expensive than the operation and the performance is also inferior, as shown from the experiments (see Table 2).
Table 2.
Method | Input | OA (%) |
---|---|---|
PointNet [3] | 1024 | 89.2 |
PointNet++ [4] | 1024 | 90.7 |
PointNet++ [4] | 5000 + n | 91.9 |
PointCNN [38] | 1024 | 92.2 |
DGCNN [16] | 1024 | 92.2 |
PCNN [39] | 1024 | 92.3 |
SpiderCNN [20] | 1024 + n | 92.2 |
SpiderCNN-4 [20] | 1024 + n | 92.4 |
PointConv [19] | 1024 + n | 92.5 |
Point2seq [23] | 1024 | 92.6 |
RS-CNN [40] | 1024 | 92.9 |
SO-Net-2 [41] | 2048 | 90.9 |
SO-Net-3 [41] | 5000 + n | 93.4 |
Ours (BASELINE) | 1024 | 92.5 |
Ours (global attention) | 1024 | 92.9 |
Ours (hard attention) | 1024 | 92.5 |
Ours (multi-attention) | 1024 | 93.3 |
Ours (global + channel) | 1024 | 93.6 |
4. Experiment
4.1. Implementation Details
Our models are implemented in Pytorch. All training and testing experiments run on a single GPU (GTX 1080 Ti). We utilize the SGD optimizer with initial learning rate and cosine annealing scheduler ( training epochs & minimum learning rate = 0) [42]. Our models often reach within three hours and approximately take 16∼18 h on ModelNet40 to converge and achieve best results. The channel size of four EdgeConv layers (1 CW-EdgeConv++ and 3 CW-EdgeConv) for classification are sequentially.
4.2. Classification Results
Datasets
In the task of classification, we evaluate on several datasets ModelNet40 [10] and Electron Cryo-Tomography (ECT) [43]. ModelNet40 is a dataset made up of 40 common object categories with 100 CAD models per category, among which all the point sets are augmented by scaling, translation, and shuffling. The single-particle ECT [43] dataset consists of 3D images of seven classes of macrocellular structures. We apply constant sampling to generate 400 point cloud data for each class. Compared with other general point cloud dataset, the structures between different classes in ECT dataset are more similar to each other.
4.3. ModelNet40
In Table 2, we compare our model with existing state-of-the-art models on ModelNet40 datasets. For a fair comparison, we strictly follow the technique of training and data augmentation in DGCNN (translation, scale and shuffle). Besides, we forsake the voting test because decision by multiple models will largely increase the cost of time and space and conceal the real capability of a single model. Results showed that our (global+ channel attention) model achieves state-of-the-art (93.6%) when the input is 1024 points without majority voting. Other models with different attention mechanism also achieved improvement compared with existing state-of-the-art models with 1024 input points. represents the baseline model which contains only EdgeConv (no CW-EdgeConv and global attention module). The architecture of this baseline model is slightly different from the DGCNN [16]. (multi-attention) means that every Spatial-Attention EdgeConv layer is substituted for CW-EdgeConv and Global-Attention is removed. (hard-attention) represents the model only using the Hard-Attention module.
As shown in Table 2, the overall accuracy (OA) of our baseline model reaches 92.5%. Only with the extension of global attention module which increasing very few parameters, the model can achieve 92.9% OA. Replacing all EdgeConv with CW-EdgeConv and retaining the global attention module, the model performs 1.1% better than our baseline.
Actually, Spatial-Attention EdgeConv shares the same idea with Global-Attention module except for the number and location of masking. In order to carry out the ablation study of such attention mechanism, we first remove all SE-Blocks and the Global-Attention Module in Figure 2. Then, we replace different number of common EdgeConv with Spatial-Attention EdgeConv (from left to right in Figure 2) and compare the results in Table 3. From the outcomes, when the number of replacements is 0, which means only common EdgeConv in our model, the result reach 92.8%. Moreover, we find the model numbers 1 and 4 generate better results than the others. This inspire us that probably the fundamental geometric information extracted right after the first layer and the masking on the global feature are more important, thus prompting us to design the Global-Attention Module which can achieve a best trade-off between accuracy and complexity of time and space.
Table 3.
Number | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
OA (%) | 92.8 | 93.1 | 92.6 | 92.9 | 93.3 |
4.4. ECT
In the test on ECT dataset (Table 4), our model achieved accuracy with global attention and channel weighting. By contrast, PointNet poorly classified fine-grained structures with only accuracy probably because it only has global feature aggregation and does not extract local features. PointNet++ achieved when integrated the hierarchical local feature extraction. Thanks to the attention mechanism to focus on distinct parts of macro-molecules, our model outperforms existing methods and achieves 96.28%.
Table 4.
4.5. Part Segmentation Results
4.5.1. Dataset
We evaluate our models on ShapeNet for part segmentation. The ShapeNet dataset contains 16 categories of objects and consists of 50 different parts in total. Each category has been annotated with two to six parts unequally. The training and testing 3D point sets are 14,006 and 2874, respectively. The aim is to assign every point a part label from 0 to 49. The two evaluation metrics we used are the mean IoU of 16 classes and all instances same as in [3,4,13,19].
4.5.2. Shapenet
The sizes of 4-layer MLP channels in our segmentation model are 256, 256, 128, and 50. The number of the last channel is the amount of part labels in ShapeNet. In the training phase, we used 16 batch size and consumed approximately 10 G memory on GPU. The total training time is slightly more than that in classification. From Table 5, we can see that our model achieves competitive results comparing to the models with additional input (normal vectors). The mean IoU per class is 82.8% and per instance is 85.2%.
Table 5.
Considering the slight difference of architecture between AttPNet and DGCNN, it is evident that our CW-EdgeConv and global attention module have impact on the performance of whole model. Compared with minor improvement between previous state-of-the-art models (see Table 5), AttPNet achieves 0.5% improvement in class mIou than DGCNN.
4.6. Visualization of Attention
We visualize the global-attention mask on point clouds of ModelNet40 dataset. In Figure 5, the color from dark to light represents the soft-attention weight from high to low. Therefore, the darker area is the focus of our global-attention model. We can see that our model underlines the corner and boundary of objects such as airplanes, desks, chairs probably because the point groups of these regions have unique geometric information. Furthermore, our model automatically focuses on such characteristic regions as the strap of bags and the flowers of a vase which make them distinct from other classes.
4.7. Robustness
4.7.1. Missing Points
We study the robustness of our model to random input dropout compared with PointNet++ without retraining. Figure 6 showed that our approach still can achieve more than on both overall and average accuracy of ModelNet40 dataset with only a half of the original number of points. Moreover, the accuracy of our model is significantly better than PointNet++ when the number of input points is between 384 to 768.
4.7.2. Rotation and Translation Perturbations
We also compare the robustness to rotation and translation invariance between our model and PointNet++. The results (Table 1) demonstrate that our model is completely translation invariant and highly robust to the small-range rotation difference with training dataset. There is no point set rotation in data augmentation when training.
5. Additional Visualization of Rotated Attention Mask
We exploit the influence of rotation of point sets on the generation of our global attention mask. Figure 7 demonstrates that our global attention mask is robust to rotation of point sets. Take the first figure of the plane as an example, although there are some minor difference between two attention masks (such as the the head of the plane), most weight of points in the generated mask remains the same when the input rotates. It still pays more attention to margin areas such as the tail and wings than inner parts of the plane.
6. Additional Ablation Study on the ModelNet40 Dataset
Attention Mechanism. Addition to the examination of our attention mechanism in Section 4.2 and Table 3 of main text, we compare the accuracy of five different models with respect to the number of epochs to demonstrate the general trends. In Figure 8, the model numbers 1 and 4 always achieve high overall accuracy after epoch . Except for model number 2, all other models with Spatial-Attention EdgeConvs attain better outcomes than number 0. It demonstrate that the results between models with and without our attention modules have distinct gap after training about 200 epochs and the accuracy does improve by our attention mechanism. In Figure 9, we can see that the accuracy of the model with only Global-Attention Module is always higher than the other which also proves the effectiveness of our design.
Other Layers. In Table 6, we demonstrate the model performance in different situations. We consider factors including number of points, batch norm layers [36], dropout layers, activation layers, and aggregation functions. From the results in Table 6, we can see all these factors do improve performance except for increasing the number of input points. The phenomena that the increase in the number of points does not increase the performance has also been observed in other state-of-the-arts work [40].
Table 6.
Model | #Points | BN | DP | Act. | Max | Max&Avg. | Acc |
---|---|---|---|---|---|---|---|
A1 | 1 k | LR | √ | 90.8 | |||
A2 | 1 k | √ | LR | √ | 91.6 | ||
A3 | 1 k | √ | LR | √ | 93.2 | ||
A4 | 1 k | √ | √ | LR | √ | 93.6 | |
A5 | 1 k | √ | √ | R | √ | 92.8 | |
A6 | 2 k | √ | √ | LR | √ | 93.6 | |
A7 | 1 k | √ | √ | LR | √ | 93.3 |
Number of points k of a group. In Table 7, we also evaluate the effects of different number k of a neighboring point group. Experiments show that small k () achieves similar accuracy both with 1024 and 2048 input points. By contrast, the performance of large k () will decrease quickly when having 1024 input points probably because it is hard to find discriminative patterns in large groups. However, more input points (2048 points) will make point sets dense so that groups containing large number of neighboring points may cover regions of same volumes as in sparse point sets with less input points (1024 points), thus maintaining a high accuracy. Besides, despite fed on different input points, achieves best result in both models B2 and B6 in Table 7.
Table 7.
7. Conclusions
In this paper, we propose a novel model named AttPNet that combines a global point-wise attention mechanism and channel weighting to improve performance of point set analysis. AttPNet outperforms the best model in ModelNet40 classification benchmark by 0.7%, which is a significant improvement. In addition, AttPNet is robust to rotational perturbations and missing points. Further experiments also demonstrate that our model performs well on the classification of fine-grained point sets such as the ECT dataset. Furthermore, we provide the visualization of our attention masks on the objects in ModelNet40 and the results of part segmentation in ShapeNet (see Figure 10).
In the future, we will continue optimizing the AttPNet and apply it to other fields such as semantic segmentation. In addition, experiments indicate that there is still considerable potential for improvement in recognizing data with large-angle rotation. Therefore, we will keep working on the robustness of our model. Besides, in many data sets, points are always distributed in an unequal spatial distribution. We will attempt to adapt our model to such attributes in point set and make it focus more on dense areas to attain greater performance.
Author Contributions
Conceptualization, M.X., Y.Y. and Y.M.; methodology, Y.Y.; validation, Y.Y. and Y.M.; formal analysis, Y.Y.; visualization, Y.M. and Y.Y.; paper writing, Y.Y., Y.M., J.Z., X.G. and M.X.; supervision, M.X.; project administration, M.X. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by U.S. National Institutes of Health (NIH) grant P41GM103712, R01GM134020, and K01MH123896, U.S. National Science Foundation (NSF) grant DBI-1949629 and IIS-2007595. Mark Foundation for Cancer Research grant 19-044-ASP, King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/2602-01 and URF/1/3007-01.
Conflicts of Interest
The authors declare no conflict of interest.
References
- 1.Petti F.M., Avanzini M., Belvedere M., De Gasperi M., Ferretti P., Girardi S., Remondino F., Tomasoni R. Digital 3D modelling of dinosaur footprints by photogrammetry and laser scanning techniques: Integrated approach at the Coste dell’Anglone tracksite (Lower Jurassic, Southern Alps, Northern Italy) Acta Geol. 2008;83:303–315. [Google Scholar]
- 2.Ahmed E., Saint A., Shabayek A.E.R., Cherenkova K., Das R., Gusev G., Aouada D., Ottersten B. Deep learning advances on different 3D data representations: A survey. arXiv. 20181808.01462 [Google Scholar]
- 3.Qi C.R., Su H., Mo K., Guibas L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 652–660. [Google Scholar]
- 4.Qi C.R., Yi L., Su H., Guibas L.J. Advances in Neural Information Processing Systems. Curran Associates, Inc.; Long Beach, CA, USA: 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space; pp. 5099–5108. [Google Scholar]
- 5.Hu J., Shen L., Sun G. Squeeze-and-excitation networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- 6.Su H., Maji S., Kalogerakis E., Learned-Miller E. Multi-view convolutional neural networks for 3D shape recognition; Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile. 7–13 December 2015; pp. 945–953. [Google Scholar]
- 7.Qi C.R., Su H., Nießner M., Dai A., Yan M., Guibas L.J. Volumetric and multi-view cnns for object classification on 3D data; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 5648–5656. [Google Scholar]
- 8.Lawin F.J., Danelljan M., Tosteberg P., Bhat G., Khan F.S., Felsberg M. International Conference on Computer Analysis of Images and Patterns. Springer; Berlin/Heidelberg, Germany: 2017. Deep projective 3D semantic segmentation; pp. 95–107. [Google Scholar]
- 9.Schneider S., Himmelsbach M., Luettel T., Wuensche H.J. Fusing vision and lidar-synchronization, correction and occlusion reasoning; Proceedings of the 2010 IEEE Intelligent Vehicles Symposium; San Diego, CA, USA. 21–24 June 2010; pp. 388–393. [Google Scholar]
- 10.Wu Z., Song S., Khosla A., Yu F., Zhang L., Tang X., Xiao J. 3D shapenets: A deep representation for volumetric shapes; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA. 7–12 June 2015; pp. 1912–1920. [Google Scholar]
- 11.Maturana D., Scherer S. Voxnet: A 3D convolutional neural network for real-time object recognition; Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Hamburg, Germany. 28 September–2 October 2015; pp. 922–928. [Google Scholar]
- 12.Brock A., Lim T., Ritchie J.M., Weston N.J. Generative and Discriminative Voxel Modeling with Convolutional Neural Networks; Proceedings of the Neural Inofrmation Processing Conference: 3D Deep Learning; Barcelona, Spain. 5–10 December 2016. [Google Scholar]
- 13.Su H., Jampani V., Sun D., Maji S., Kalogerakis E., Yang M.H., Kautz J. Splatnet: Sparse lattice networks for point cloud processing; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 2530–2539. [Google Scholar]
- 14.Jampani V., Kiefel M., Gehler P.V. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 4452–4461. [Google Scholar]
- 15.Riegler G., Osman Ulusoy A., Geiger A. Octnet: Learning deep 3D representations at high resolutions; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 3577–3586. [Google Scholar]
- 16.Wang Y., Sun Y., Liu Z., Sarma S.E., Bronstein M.M., Solomon J.M. Dynamic graph cnn for learning on point clouds. arXiv. 2018 doi: 10.1145/3326362.1801.07829 [DOI] [Google Scholar]
- 17.Shen Y., Feng C., Yang Y., Tian D. Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018. [Google Scholar]
- 18.Xu M., Zhou Z., Qiao Y. Geometry Sharing Network for 3D Point Cloud Classification and Segmentation; Proceedings of the National Conference on Artificial Intelligence (AAAI); New York, NY, USA. 7–12 February 2020; pp. 12500–12507. [Google Scholar]
- 19.Wu W., Qi Z., Fuxin L. Pointconv: Deep convolutional networks on 3D point clouds; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA. 16–20 June 2019; pp. 9621–9630. [Google Scholar]
- 20.Xu Y., Fan T., Xu M., Zeng L., Qiao Y. Spidercnn: Deep learning on point sets with parameterized convolutional filters; Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany. 8–14 September 2018; pp. 87–102. [Google Scholar]
- 21.Groh F., Wieschollek P., Lensch H.P. Flex-Convolution; Proceedings of the Asian Conference on Computer Vision; Perth, Australia. 2–6 December 2018; pp. 105–122. [Google Scholar]
- 22.Wang L., Liu Y., Zhang S., Yan J., Tao P. Structure-Aware Convolution for 3D Point Cloud Classification and Segmentation. Remote Sens. 2020;12:634. doi: 10.3390/rs12040634. [DOI] [Google Scholar]
- 23.Liu X., Han Z., Liu Y.S., Zwicker M. Point2Sequence: Learning the shape representation of 3D point clouds with an attention-based sequence to sequence network; Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA. 27 January–1 February 2019; pp. 8778–8785. [Google Scholar]
- 24.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 25.Itti L. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. 1998;20:1254–1259. doi: 10.1109/34.730558. [DOI] [Google Scholar]
- 26.Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv. 20141409.0473 [Google Scholar]
- 27.Fu J., Zheng H., Mei T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 4438–4446. [Google Scholar]
- 28.Wang F., Jiang M., Qian C., Yang S., Li C., Zhang H., Wang X., Tang X. Residual attention network for image classification; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- 29.Chen L.C., Yang Y., Wang J., Xu W., Yuille A.L. Attention to scale: Scale-aware semantic image segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 3640–3649. [Google Scholar]
- 30.Xie S., Liu S., Chen Z., Tu Z. Attentional ShapeContextNet for Point Cloud Recognition; Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 4606–4615. [DOI] [Google Scholar]
- 31.Yang J., Zhang Q., Ni B., Li L., Liu J., Zhou M., Tian Q. Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling; Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA. 16–20 June 2019; pp. 3323–3332. [Google Scholar]
- 32.Li W., Wang F.D., Xia G.S. A geometry-attentional network for ALS point cloud classification. ISPRS J. Photogramm. Remote Sens. 2020;164:26–40. doi: 10.1016/j.isprsjprs.2020.03.016. [DOI] [Google Scholar]
- 33.Hu Z., Zhang D., Li S., Qin H. Attention-based relation and context modeling for point cloud semantic segmentation. Comput. Graph. 2020 doi: 10.1016/j.cag.2020.06.001. [DOI] [Google Scholar]
- 34.Wang L., Huang Y., Hou Y., Zhang S., Shan J. Graph Attention Convolution for Point Cloud Semantic Segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA. 16–20 June 2019; pp. 10296–10305. [Google Scholar]
- 35.Hastie T., Tibshirani R., Friedman J., Franklin J. The elements of statistical learning: Data mining, inference and prediction. Math. Intell. 2005;27:83–85. [Google Scholar]
- 36.Ioffe S., Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift; Proceedings of the International Conference on Machine Learning; Lille, France. 6 July 2015; pp. 448–456. [Google Scholar]
- 37.Nair V., Hinton G.E. Rectified linear units improve restricted boltzmann machines; Proceedings of the 27th International Conference on Machine Learning (ICML-10); Haifa, Israel. 21–24 June 2010; pp. 807–814. [Google Scholar]
- 38.Li Y., Bu R., Sun M., Wu W., Di X., Chen B. Pointcnn: Convolution on x-transformed points; Proceedings of the Advances in Neural Information Processing Systems; Montréal, QC, Canada. 3–8 December 2018; pp. 820–830. [Google Scholar]
- 39.Atzmon M., Maron H., Lipman Y. Point convolutional neural networks by extension operators. arXiv. 2018 doi: 10.1145/3197517.3201301.1803.10091 [DOI] [Google Scholar]
- 40.Liu Y., Fan B., Xiang S., Pan C. Relation-Shape Convolutional Neural Network for Point Cloud Analysis; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA. 16–20 June 2019; pp. 8895–8904. [Google Scholar]
- 41.Li J., Chen B.M., Hee Lee G. So-net: Self-organizing network for point cloud analysis; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 9397–9406. [Google Scholar]
- 42.Loshchilov I., Hutter F. Sgdr: Stochastic gradient descent with warm restarts. arXiv. 20161608.03983 [Google Scholar]
- 43.Noble A.J., Dandey V.P., Wei H., Brasch J., Chase J., Acharya P., Tan Y.Z., Zhang Z., Kim L.Y., Scapin G., et al. Routine single particle CryoEM sample and grid characterization by tomography. Elife. 2018;7:e34257. doi: 10.7554/eLife.34257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yi L., Su H., Guo X., Guibas L. SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 9397–9406. [Google Scholar]