Abstract
Graph convolutional networks (GCNs) have shown promising results in processing graph data by extracting structure-aware features. This gave rise to extensive work in geometric deep learning, focusing on designing network architectures that ensure neuron activations conform to regularity patterns within the input graph. However, in most cases the graph structure is only accounted for by considering the similarity of activations between adjacent nodes, which limits the capabilities of such methods to discriminate between nodes in a graph. Here, we propose to augment conventional GCNs with geometric scattering transforms and residual convolutions. The former enables band-pass filtering of graph signals, thus alleviating the so-called oversmoothing often encountered in GCNs, while the latter is introduced to clear the resulting features of high-frequency noise. We establish the advantages of the presented Scattering GCN with both theoretical results establishing the complementary benefits of scattering and GCN features, as well as experimental results showing the benefits of our method compared to leading graph neural networks for semi-supervised node classification, including the recently proposed GAT network that typically alleviates oversmoothing using graph attention mechanisms.
1. Introduction
Deep learning approaches are at the forefront of modern machine learning. While they are effective in a multitude of applications, their most impressive results are typically achieved when processing data with inherent structure that can be used to inform the network architecture or the neuron connectivity design. For example, image processing tasks gave rise to convolutional neural networks that rely on spatial organization of pixels, while time series analysis gave rise to recurrent neural networks that leverage temporal organization in their information processing via feedback loops and memory mechanisms. The success of neural networks in such applications, traditionally associated with signal processing, has motivated the emergence of geometric deep learning, with the goal of generalizing the design of structure-aware network architectures from Euclidean spatiotemporal structures to a wide range of non-Euclidean geometries that often underlie modern data.
Geometric deep learning approaches typically use graphs as a model for data geometries, either by constructing them from input data (e.g., via similarity kernels) or directly given as quantified interactions between data points [1]. Using such models, recent works have shown that graph neural networks (GNNs) perform well in multiple application fields, including biology, chemistry and social networks [2–4]. It should be noted that most GNNs consider each graph together with given node features, as a generalization of images or audio signals, and thus aim to compute whole-graph representations. These in turn, can be applied to graph classification, for example when each graph represents the molecular structure of proteins or enzymes classified by their chemical properties [5–7].
On the other hand, methods such as graph convolutional networks (GCNs) presented by [4] consider node-level tasks and in particular node classification. As explained in [4], such tasks are often considered in the context of semi-supervised learning, as typically only a small portion of nodes of the graph possesses labels. In these settings, the entire dataset is considered as one graph and the network is tasked with learning node representations that infer information from node features as well as the graph structure. However, most state-of-the-art approaches for incorporating graph structure information in neural network operations aim to enforce similarity between representations of adjacent (or neighboring) nodes, which essentially implements local smoothing of neuron activations over the graph [8]. While such smoothing operations may be sufficiently effective in whole-graph settings, they often cause degradation of results in node processing tasks due to oversmoothing [8, 9], as nodes become indistinguishable with deeper and increasingly complex network architectures. Graph attention networks [10] have shown promising results in overcoming such limitations by introducing adaptive weights for graph smoothing via message passing operations, using attention mechanisms computed from node features and masked by graph edges. However, these networks still essentially rely on enforcing similarity (albeit adaptive) between neighboring nodes, while also requiring more intricate training as their attention mechanism requires gradient computations driven not only by graph nodes, but also by graph edges. We refer the reader to the supplement for further discussion of related work and recent advances in node processing with GNNs.
In this paper, we propose a new approach for node-level processing in GNNs by introducing neural pathways that encode higher-order forms of regularity in graphs. Our construction is inspired by recently proposed geometric scattering networks [11–13], which have proven effective for whole-graph representation and classification. These networks generalize the Euclidean scattering transform, which was originally presented by [14] as a mathematical model for convolutional neural networks. In graph settings, the scattering construction leverages deep cascades of graph wavelets [15, 16] and pointwise nonlinearities to capture multiple modes of variation from node features or labels. Using the terminology of graph signal processing, these can be considered as generalized band-pass filtering operations, while GCNs (and many other GNNs) can be considered as relying on low-pass filters only. Our approach combines together the merits of GCNs on node-level tasks with those of scattering networks known from whole-graph tasks, by enabling learned node-level features to encode geometric information beyond smoothed activation signals, thus alleviating oversmoothing concerns often raised in GCN approaches. We discuss the benefits of our approach and demonstrate its advantages over GCNs and other popular graph processing approaches for semi-supervised node classification, including significant improvements on the DBLP graph dataset from [17].
Notations:
We denote matrices and vectors with bold letters with uppercase letters representing matrices and lowercase letters representing vectors. In particular, is used for the identity matrix and denotes the vector with ones in every component. We write for the standard scalar product in . We will interchangeably consider functions of graph nodes as vectors indexed by the nodes, implicitly assuming a correspondence between a node and a specific index. This carries over to matrices, where we relate nodes to column or row indices. We further use the abbreviation where and write .
2. Graph Signal Processing
Let be a weighted graph with the set of nodes, the set of (undirected) edges and assigning (positive) edge weights to the graph edges. We note that can equivalently be considered as a function of , where we set the weights of non-adjacent node pairs to zero. We define a graph signal as a function on the nodes of and aggregate them in a signal vector with the entry being .
We define the (combinatorial) graph Laplacian matrix , where is the weighted adjacency matrix of the graph given by
and is the degree matrix of defined by with being the degree of the node . In practice, we work with the (symmetric) normalized Laplacian matrix . It can be verified that is symmetric and positive semi-definite and can thus be orthogonally diagonalized as , where is a diagonal matrix with the eigenvalues on the main diagonal and is an orthogonal matrix containing the corresponding normalized eigenvectors as its columns.
A detailed study (see, e.g., [18]) of the eigenvalues reveals that . We can interpret the as the frequency magnitudes and the as the corresponding Fourier modes. We accordingly define the Fourier transform of a signal vector by for . The corresponding inverse Fourier transform is given by . Note that this can be written compactly as and . Finally, we introduce the concept of graph convolutions. We define a filter defined on the set of nodes and want to convolve the corresponding filter vector with a signal vector , i.e. . To explicitly compute this convolution, we recall that in the Euclidean setting, the convolution of two signals equals the product of their corresponding frequencies. This property generalizes to graphs [19] in the sense that for . Applying the inverse Fourier transform yields
where . Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in .
Furthermore, it can be verified [20] that when these coefficients are defined as polynomials for of the Laplacian eigenvalues in (i.e. ), the resulting filter convolution are localized in space and can be written in terms of as without requiring spectral decomposition of the normalized Laplacian. This motivates the standard practice [4, 20–22] of using filters that have polynomial forms, which we follow here as well.
For completeness, we note there exist alternative frameworks that generalize signal processing notions to graph domains, such as [23], which emphasizes the construction of complex filters that requires a notion of signal phase on graphs. However, extensive study of such alternatives is out of scope for the current work, which thus relies on the well-established (see, e.g., [24]) framework described here.
3. Graph Convolutional Network
Graph convolutional networks (GCNs), introduced in [4], consider semi-supervised settings where only a small potion of the nodes is labeled. They leverage intrinsic geometric information encoded in the adjacency matrix together with node labels by constructing a convolutional filter parametrized by , where the choice of a single learnable parameter is made to avoid overfitting. This parametrization yields a convolutional filtering operation given by
(1) |
The matrix has eigenvalues in [0,2]. This could lead to vanishing or exploding gradients. This issue is addressed by the following renormalization trick [4]: , where and a diagonal matrix with for . This operation replaces the features of the nodes by a weighted average of itself and its neighbors. Note that the repeated execution of graph convolutions will enforce similarity throughout higher-order neighborhoods with order equal to the number of stacked layers. Setting , the complete layer-wise propagation rule takes the form , where indicates the layer with neurons, the activation vector of the neuron, the learned parameter of the convolution with the incoming activation vector from the preceding layer and an element-wise applied activation function. Written in matrix notation, this gives
(2) |
where is the weight-matrix of the layer and contains the activations outputted by the layer.
We remark that the above explained GCN model can be interpreted as a low-pass operation. For the sake of simplicity, let us consider the convolutional operation (Eq. 1) before the reparametrization trick. If we observe the convolution operation as the summation , we clearly see that higher weights are put on the low-frequency harmonics, while high-frequency harmonics are progressively less involved as . This indicates that the model can only access a diminishing portion of the original information contained in the input signal the more graph convolutions are stacked. This observation is in line with the well-known oversmoothing problem [8] related to GCN models. The repeated application of graph convolutions will successively smooth the signals of the graph such that nodes cannot be distinguished anymore.
4. Geometric Scattering
In this section, we recall the construction of geometric scattering on graphs. This construction is based on the lazy random walk matrix
which is closely related to the graph random walk defined as a Markov process with transition matrix . The matrix however allows self loops while normalizing by a factor of two in order to retain a Markov process. Therefore, considering a distribution of the initial position of the lazy random walk, its positional distribution after steps is encoded by .
As discussed in [12], the propagation of a graph signal vector by performs a low-pass operation that preserves the zero-frequencies of the signal while suppressing high frequencies. In geometric scattering, this low-pass information is augmented by introducing the wavelet matrices of scale ,
(3) |
This leverages the fact that high frequencies can be recovered with multiscale wavelet transforms, e.g., by decomposing nonzero frequencies into dyadic frequency bands. The operation collects signals from a neighborhood of order , but extracts multiscale differences rather than averaging over them. The wavelets in Eq. 3 can be organized in a filter bank , where is a pure low-pass filter. The telescoping sum of the matrices in this filter bank constitutes the identity matrix, thus enabling to reconstruct processed signals from their filter responses. Further studies of this construction and its properties (e.g., energy preservation) appear in [25] and related work.
Geometric scattering was originally introduced in the context of whole-graph classification and consisted of aggregating scattering features. These are stacked wavelet transforms (see Fig. 1) parameterized via tuples containing the bandwidth scale parameters, which are separated by element-wise absolute value nonlinearities2 according to
(4) |
where corresponds to the length of the tuple . The scattering features are aggregated over the whole graph by taking -order moments over the set of nodes,
(5) |
Figure 1:
Illustration of geom. scattering at the node level and at the graph level , extracted according to the wavelet cascade in Eqs. 3–5. While orders are illustrated here, more can be used in general.
As our work is devoted to the study of node-based classification, we reinvent this approach in a new context, keeping the scattering transforms on a node-level by dismissing the aggregation step in Eq. 5. For each tuple , we define the following scattering propagation rule, which mirrors the GCN rule but replaces the low-pass filter by a geometric scattering operation resulting in
(6) |
We note that in practice, we only choose a subset of tuples, which is chosen as part of the network design explained in the following section.
5. Combining GCN and Scattering Models
To combine the benefits of GCN models and geometric scattering adapted to the node level, we now propose a hybrid network architecture as shown in Fig. 2. It combines low-pass operations based on GCN models with band-pass operations based on geometric scattering. To define the layer-wise propagation rule, we introduce
which are the concatenations of channels and , respectively. Every is defined according to Eq. 2 with the slight modification of added biases and powers of ,
Figure 2:
(a,b) Comparison between GCN and our network: we add band-pass channels to collect different frequency components; (c) Graph residual convolution layer; (d) Band-pass layers; (e) Schematic depiction in the frequency domain.
Note that every GCN filter uses a different propagation matrix and therefore aggregates information from -step neighborhoods. Similarly, we proceed with according to Eq. 6 and calculate
where enables scatterings of different orders and scales. Finally, the GCN components and scattering components get concatenated to
(7) |
The learned parameters are the weight matrices coming from the convolutional and scattering layers. These are complemented by vectors of the biases , which are transposed and vertically concatenated times to the matrices . To simplify notation, we assume here that all channels use the same number of neurons . Waiving this assumption would slightly complicate the notation but works perfectly fine in practice.
In this work, for simplicity, and because it is sufficient to establish our claim, we limit our architecture to three GCN channels and two scattering channels as illustrated in Fig. 2 (b). Inspired by the aggregation step in classical geometric scattering, we use as our nonlinearity. However, unlike the powers in Eq. 5, the power is applied at the node-level here instead of being aggregated as moments over the entire graph, thus retaining the distinction between node-wise activations.
We set the input of the first layer to have the original node features as the graph signal. Each subchannel (GCN or scattering) transforms the original feature space to a new hidden space with the dimension determined by the number of neurons encoded in the columns of the corresponding submatrix of . These transformations are learned by the network via the weights and biases. Larger matrices (i.e., more columns as the number of nodes in the graph is fixed) indicate that the weight matrices have more parameters to learn. Thus, the information in these channels can be propagated well and will be sufficiently represented.
In general, the width of a channel is relevant for the importance of the captured regularities. A wider channel suggests that these frequency components are more critical and need to be sufficiently learned. Reducing the width of the channel suppresses the magnitude of information that can be learned from a particular frequency window. For more details and analysis of specific design choices in our architecture we refer the reader to the ablation study provided in the supplement.
6. Graph Residual Convolution
Using the combination of GCN and scattering architectures, we collect multiscale information at the node level. This information is aggregated from different localized neighborhoods, which may exhibit vastly different frequency spectra. This comes for example from varying label rates in different graph substructures. In particular, very sparse graph sections can cause problems when the scattering features actually learn the difference between labeled and unlabeled nodes, creating high-frequency noise. In the classical geometric scattering used for whole-graph representation, geometric moments were used to aggregate the node-based information, serving at the same time as a low-pass filter. As we want to keep the information localized on the node level, we choose a different approach inspired by skip connections in residual neural networks [26]. Conceptually, this low-pass filter, which we call graph residual convolution, reduces the captured frequency spectrum up to a cutoff frequency as depicted in Fig. 2 (e).
The graph residual convolution matrix, governed by the hyperparameter , is given by and we apply it after the hybrid layer of GCN and scattering filters. For we get the identity (no cutoff), while results in . This can be interpreted as an interpolation between the completely lazy (i.e., stationary) random walk and the non-resting (i.e., with no self-loops) random walk . We apply the graph residual layer on the output of the Scattering GCN layer (Eq. 7). The update rule for this step, illustrated in Fig. 2 (c), is then expressed by , where are learned weights, are learned biases (similar to the notations used previously), and is the number of features of the concatenated layer . If is the final layer, we choose equal to the number of classes.
7. Additional Information Introduced by Node-level Scattering Features
Before empirically verifying the viability of the proposed architecture in node classification tasks, we first discuss and demonstrate the additional information provided by scattering channels beyond that provided by traditional GCN channels. We first consider information carried by node features, treated as graph signals, and in particular their regularity over the graph. As discussed in Sec. 3, such regularity is traditionally considered only via smoothness of signals over the graph, as only low frequencies are retained by (local) smoothing operations. Band-pass filtering, on the other hand, can retain other forms of regularity such as periodic or harmonic patterns. The following lemma demonstrates this difference between GCN and scattering channels.
Lemma 1.
Consider a cyclic graph on 2n nodes, , and let be a 2-periodic signal on it (i.e., and , for for some ). Then, for any the GCN filtering from Eq. 1 yields a constant signal, while the scattering filter from Eq. 3 still produces a 2-periodic signal. Further, this result extends to any finite linear cascade of such filters (i.e., or with filter applications in each).
While this is only a simple example, it already indicates a fundamental difference between the regularity patterns considered in graph convolutions compared to our approach. Indeed, it implies that if a smoothing convolutional filter encounters alternating signals on isolated cyclic substructures within a graph, their node features become indistinguishable, while scattering channels (with appropriate scales, weights and bias terms) will be able to make this distinction. Moreover, this difference can be generalized further beyond cyclic structures to consider features encoding two-coloring information on constant-degree bipartite graphs, as shown in the following lemma. We refer the reader to the supplement for a proof of this lemma, which also covers the previous one as a particular case, as well as numerical examples illustrating the results in these two lemmas.
Lemma 2.
Consider a bipartite graph on nodes with constant node degree . Let be a 2-coloring signal (i.e., with one part assigned constant and the other , for some ). Then, for any , the GCN filtering from Eq. 1 yields a constant signal, while the scattering filter from Eq. 3 still produces a (non-constant) 2-coloring of the graph. Further, this result extends to any finite linear cascade of such filters (i.e., or with filter applications in each).
Beyond the information encoded in node features, graph wavelets encode geometric information even when it is not carried by input signals. Such a property has already been established, e.g., in the context of community detection, where white noise signals can be used in conjunction with graph wavelets to cluster nodes and reveal faithful community structures [27]. To demonstrate a similar property in the context of GCN and scattering channels, we give an example of a simple graph structure with two cyclic substructures of different sizes (or cycle lengths) that are connected by one bottleneck edge. In this case, it can be verified that even with constant input signals, some geometric information is encoded by its convolution with graph filters as illustrated in Fig. 3 (we refer the reader to the supplement for exact calculation of filter responses). However, as demonstrated in this case, while the information provided by the GCN filter responses from Eq. 1 is not constant, it does not distinguish between the two cyclic structures (and a similar pattern can be verified for ). Formally, each node in one cycle is shown to have at least one node in the other with the same filter response (i.e., ). In contrast, the information extracted by the wavelet filter response (used in geometric scattering) distinguishes between cycles and would allow for their separation. We note that this property generalizes to other cycle lengths as discussed in the supplement, but leave more extensive study of geometric information encoding in graph wavelets to future work.
Figure 3:
Filter responses for (a) the GCN filter (Eq. 1) and (b) a scattering filter applied to a constant signal over a graph with two cyclic substuctures connected by a single-edge bottleneck. Color coding differs slightly between plots, but is consistent within each plot, indicating nodes with numerically indistinguishable response values.
8. Empirical Results
To evaluate our Scattering GCN approach, we compare it to several established methods for semi-supervised node classification, including the original GCN [4], which is known to be subject to the oversmoothing problem, as discussed in [8], and Sec. 1 and 3 here. Further, we compare our approach with two recent methods that address the oversmoothing problem. The approach in [8] directly addresses oversmoothing in GCNs by using partially absorbing random walks [28] to mitigate rapid mixing of node features in highly connected graph regions. The graph attention network (GAT) [10] indirectly addresses oversmoothing by training adaptive node-wise weighting of the smoothing operation via an attention mechanism. Furthermore, we also include two alternatives to GCN networks based on Chebyshev polynomial filters [20] and belief propagation of label information [29] computed via Gaussian random fields. Finally, we include two baseline approaches to verify the contribution of our hybrid approach compared to compared to the classifier from [13] that is solely based on handcrafted graph-scattering features, and compared to SVM classifier acting directly on node features without considering graph edges, which does not incorporate any geometric information.
The methods from [4, 8, 10, 20, 29] were all executed using the original implementations accompanying their publications. These are tuned and evaluated using the standard splits provided for the benchmark datasets for fair comparison. We ensure that the reported classification accuracies agree with previously published results when available. The tuning of our method (including hyperparameters and composition of GCN and scattering channels) on each dataset was done via grid search (over a fixed set of choices for all datasets) using the same cross validation setup used to tune competing methods. For further details, we refer the reader to the supplement, which contains an ablation study evaluating the importance of each component in our proposed architecture.
Our comparisons are based on four popular graph datasets with varying sizes and connectivity structures summarized in Tab. 1 (see, e.g., [30] for Citeseer, Cora, and Pubmed, and [17] for DBLP). We order the datasets by increasing connectivity structure, reflected by their node degrees and edges-to-nodes ratios. As discussed in [8], increased connectivity leads to faster mixing of node features in GCN, exacerbating the oversmoothing problem (as nodes quickly become indistinguishable) and degrading classification performance. Therefore, we expect the impact of scattering channels and the relative improvement achieved by Scattering GCN to correspond to the increasing connectivity order of datasets in Tab. 1, which is maintained for our reported results in Tab. 2 and Fig. 4.
Table 1:
Dataset characteristics: number of nodes, edges, and features; mean ± std. of node degrees; ratio of #edges to #nodes.
Dataset | Nodes | Edges | Features | Degrees | |
---|---|---|---|---|---|
Citeseer | 3,327 | 4,732 | 3,703 | 3.77±3.38 | 1.42 |
Cora | 2,708 | 5,429 | 1,433 | 4.90±5.22 | 2.00 |
Pubmed | 19,717 | 44,338 | 500 | 5.50±7.43 | 2.25 |
DBLP | 17,716 | 52,867 | 1639 | 6.97±9.35 | 2.98 |
Table 2:
Classification accuracy (top two marked in bold; best one underlined) of Scattering GCN on four benchmark datasets compared to four other GNNs [10, 8, 4, 20], a non-GNN approach [29] based on belief propagation, a pure graph scattering baseline [13], and a nongeometric baseline only using node features with linear SVM.
Model | Citeseer | Cora | Pubmed | DBLP |
---|---|---|---|---|
Scattering GCN (ours) | 71.7 | 84.2 | 79.4 | 81.5 |
GAT [10] | 72.5 | 83.0 | 79.0 | 66.1 |
Partially absorbing [8] | 71.2 | 81.7 | 79.2 | 56.9 |
GCN [4] | 70.3 | 81.5 | 79.0 | 59.3 |
Chebyshev [20] | 69.8 | 78.1 | 74.4 | 57.3 |
Label Propagation [29] | 58.2 | 77.3 | 71.0 | 53.0 |
| ||||
Graph scattering [13] | 67.5 | 81.9 | 69.8 | 69.4 |
| ||||
Node features (SVM) | 61.1 | 58.0 | 49.9 | 48.2 |
Figure 4:
Impact of training set size (top) and training time (bottom) on classification accuracy and error (correspondingly); training size measured relative to the original training size of each dataset; training time and validation error plotted in logarithmic scale; runtime measured for all methods on the same hardware, using original implementations accompanying their publications.
We first consider test classification accuracy reported in Tab. 2, which shows that our approach outperforms other methods on three out of the four considered datasets. On the remaining one (namely Citeseer) we are only outperformed by GAT. However, we note that this dataset has the weakest connectivity structure (see Tab. 1) and the most informative node features (e.g., achieving 61.1% accuracy via linear SVM without considering any graph information). In contrast, on DBLP, which has the richest connectivity structure and least informative features (only 48.2% SVM accuracy), we significantly outperform GAT (over 15% improvement), which itself significantly outperforms all other methods (by 6.8% or more) except for the graph scattering baseline from [13].
Next, we consider the impact of training size on classification performance as we are interested in semi-supervised settings where only a small portion of nodes in the graph are labelled. Fig. 4 (top) presents the classification accuracy (on validation set) for the training size reduced to 20%, 40%, 60%, 80% and 100% of the original training size available for each dataset. These results indicate that Scattering GCN generally exhibits greater stability to sparse training conditions compared to other methods. Importantly, we note that on Citeseer, while GAT outperforms our method for the original training size, its performance degrades rapidly when training size is reduced below 60% of the original one, at which point Scattering GCN outperforms all other methods. We also note that on Pubmed, even a small decrease in training size (e.g., 80% of original) creates a significant performance gap between Scattering GCN and GAT, which we believe is due to node features being less independently informative in this case (see baseline in Tab. 2) compared to Citeseer and Cora.
Finally, in Fig. 4 (bottom), we consider the evolution of (validation) classification error during the training process. Overall, our results indicate that the training of Scattering GCN reaches low validation errors significantly faster than Partially Absorbing and GAT3, which are the two other leading methods (in terms of final test accuracy in Tab. 2). On Pubmed, which is the largest dataset considered here (by number of nodes), our error decays at a similar rate to that of GCN, showing a notable gap over all other methods. On DBLP, which has a similar number of nodes but significantly more edges, Scattering GCN takes longer to converge (compared to GCN), but as discussed before, it also demonstrates a significant (double-digit) performance lead compared to all other methods.
9. Conclusion
Our study of semi-supervised node-level classification tasks for graphs presents a new approach to address some of the main concerns and limitations of GCN models. We discuss and consider richer notions of regularity on graphs to expand the GCN approach, which solely relies on enforcing smoothness over graph neighborhoods. This is achieved by incorporating multiple frequency bands of graph signals, which are typically not leveraged in traditional GCN models. Our construction is inspired by geometric scattering, which has mainly been used for whole-graph classification so far. Our results demonstrate several benefits of incorporating the elements presented here (i.e., scattering channels and residual convolutions) into GCN architectures. Furthermore, we expect the incorporation of these elements together in more intricate architectures to provide new capabilities of pattern recognition and local information extraction in graphs. For example, attention mechanisms could be used to adaptively tune scattering configurations at the resolution of each node, rather than the global graph level used here. We leave the exploration of such research avenues for future work.
Broader Impact
Node classification in graphs is an important task that gains increasing interest nowadays in multiple fields looking into network analysis applications. For example, they are of interest in social studies, where a natural application is the study of social networks and other interaction graphs. Other popular application fields include biochemistry and epidemiology. However, this work is computational in nature and addresses the foundations of graph processing and geometric deep learning. As such, by itself, it is not expected to raise ethical concerns nor to have adverse effects on society.
Supplementary Material
Acknowledgments and Disclosure of Funding
The authors would like to thank Dongmian Zou for fruitful discussions. This work was partially funded by IVADO Professor startup & operational funds, IVADO Fundamental Research Project grant PRF-2019-3583139727, and NIH grant R01GM135929. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Footnotes
In a slight deviation from previous work, here does not include the outermost nonlinearity in the cascade.
The horizontal shift shown for GAT in Fig. 4 (bottom), indicating increased training runtime (based on the original implementation accompanying [10]), could be explained by its optimization process requiring more weights than other methods and an intensive gradient computations driven not only graph nodes, but also by graph edges considered in the multihead attention mechanism.
Contributor Information
Yimeng Min, Mila – Quebec AI Institute Montreal, QC, Canada.
Frederik Wenkel, Dept. of Math. and Stat. Université de Montréal Mila – Quebec AI Institute Montreal, QC, Canada.
Guy Wolf, Dept. of Math. and Stat. Université de Montréal Mila – Quebec AI Institute Montreal, QC, Canada.
References
- [1].Bronstein Michael M., Bruna Joan, Yann LeCun Arthur Szlam, and Vandergheynst Pierre. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34 (4):18–42, 2017. [Google Scholar]
- [2].Gilmer Justin, Schoenholz Samuel S., Riley Patrick F., Vinyals Oriol, and Dahl George E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of PMLR, pages 1263–1272, 2017. [Google Scholar]
- [3].Hamilton Will, Ying Zhitao, and Leskovec Jure. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, pages 1024–1034, 2017. [Google Scholar]
- [4].Kipf Thomas N. and Welling Max. Semi-supervised classification with graph convolutional networks. In the 4th International Conference on Learning Representations (ICLR), 2016. [Google Scholar]
- [5].Fout Alex, Byrd Jonathon, Shariat Basir, and Asa Ben-Hur. Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, volume 30, pages 6530–6539, 2017. [Google Scholar]
- [6].Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. In ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018. [Google Scholar]
- [7].Knyazev Boris, Lin Xiao, Amer Mohamed R., and Taylor Graham W. Spectral multigraph networks for discovering and fusing relationships in molecules. In NeurIPS Workshop on Machine Learning for Molecules and Materials, 2018. [Google Scholar]
- [8].Li Qimai, Han Zhichao, and Wu Xiao-Ming. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
- [9].Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv:1905.09550, 2019.
- [10].Petar Veličković Guillem Cucurull, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. Graph attention networks. In the 6th International Conference on Learning Representations (ICLR), 2018. [Google Scholar]
- [11].Gama Fernando, Ribeiro Alejandro, and Bruna Joan. Diffusion scattering transforms on graphs. In the 7th International Conference on Learning Representations (ICLR), 2019. [Google Scholar]
- [12].Gao Feng, Wolf Guy, and Hirn Matthew. Geometric scattering for graph data analysis. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of PMLR, pages 2122–2131, 2019. [Google Scholar]
- [13].Zou Dongmian and Lerman Gilad. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 49(3):1046–1074, 2020. [Google Scholar]
- [14].Mallat Stéphane. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012. [Google Scholar]
- [15].Hammond David K., Vandergheynst Pierre, and Gribonval Rémi. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. [Google Scholar]
- [16].Coifman Ronald R. and Maggioni Mauro. Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53–94, 2006. [Google Scholar]
- [17].Pan Shirui, Wu Jia, Zhu Xingquan, Zhang Chengqi, and Wang Yang. Tri-party deep network representation. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 1895–1901, 2016. [Google Scholar]
- [18].Chung Fan R. K. Spectral Graph Theory. American Mathematical Society, 1997.
- [19].Shuman David I., Ricaud Benjamin, and Vandergheynst Pierre. Vertex-frequency analysis on graphs. Applied and Computational Harmonic Analysis, 40(2):260–291, 2016. [Google Scholar]
- [20].Defferrard Michaël, Bresson Xavier, and Vandergheynst Pierre. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), volume 29, pages 3844–3852, 2016. [Google Scholar]
- [21].Susnjara Ana, Perraudin Nathanael, Kressner Daniel, and Vandergheynst Pierre. Accelerated filtering on graphs using Lanczos method. arXiv:1509.04537, 2015. [Google Scholar]
- [22].Liao Renjie, Zhao Zhizhen, Urtasun Raquel, and Zemel Richard. Lanczosnet: Multi-scale deep graph convolutional networks. In the 7th International Conference on Learning Representations (ICLR), 2019. [Google Scholar]
- [23].Oyallon Edouard. Interferometric graph transform: a deep unsupervised graph representation. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of PMLR, 2020. [Google Scholar]
- [24].Shuman David I., Narang Sunil K., Frossard Pascal, Ortega Antonio, and Vandergheynst Pierre. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013. [Google Scholar]
- [25].Perlmutter Michael, Gao Feng, Wolf Guy, and Hirn Matthew. Understanding graph neural networks with asymmetric geometric scattering transforms. arXiv:1911.06253, 2019.
- [26].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [Google Scholar]
- [27].Roddenberry T. Mitchell, Schaub Michael T., Wai Hoi-To, and Segarra Santiago. Exact blind community detection from signals on multiple graphs. arXiv:2001.10944, 2020.
- [28].Wu Xiao-Ming, Li Zhenguo, Anthony M So John Wright, and Chang Shih-Fu. Learning with partially absorbing random walks. In Advances in Neural Information Processing Systems (NeurIPS), volume 25, pages 3077–3085, 2012. [Google Scholar]
- [29].Zhu Xiaojin, Ghahramani Zoubin, and John Ds Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 912–919, 2003. [Google Scholar]
- [30].Yang Zhilin, Cohen William, and Salakhudinov Ruslan. Revisiting semi-supervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of PMLR, pages 40–48, 2016. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.