Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Nov 29.
Published in final edited form as: Proc IEEE Int Conf Acoust Speech Signal Process. 2021 May 13;2021:8518–8522. doi: 10.1109/icassp39728.2021.9414557

GEOMETRIC SCATTERING ATTENTION NETWORKS

Yimeng Min 1,3,*, Frederik Wenkel 2,3,*, Guy Wolf 2,3
PMCID: PMC8629355  NIHMSID: NIHMS1728134  PMID: 34849105

Abstract

Geometric scattering has recently gained recognition in graph representation learning, and recent work has shown that integrating scattering features in graph convolution networks (GCNs) can alleviate the typical oversmoothing of features in node representation learning. However, scattering often relies on handcrafted design, requiring careful selection of frequency bands via a cascade of wavelet transforms, as well as an effective weight sharing scheme to combine low- and band-pass information. Here, we introduce a new attention-based architecture to produce adaptive task-driven node representations by implicitly learning node-wise weights for combining multiple scattering and GCN channels in the network. We show the resulting geometric scattering attention network (GSAN) outperforms previous networks in semi-supervised node classification, while also enabling a spectral study of extracted information by examining node-wise attention weights.

Index Terms—: Graph neural networks, geometric scattering, attention, node classification, geometric deep learning

1. INTRODUCTION

Convolutional neural networks (CNNs) have shown great success on a range of tasks, including image classification, machine translation and speech recognition. By optimizing local filters in neural network-based architectures, models are able to learn expressive representations and thus perform well on regular Euclidean data. Based on CNNs, graph neural networks (GNNs) [14] show promising results on non-Euclidean data for tasks such as molecule modelling or node classification. The generalization from regular grids to irregular domains is usually implemented using the spectral graph theory framework [5, 6]. While several approaches exist to implement such filters, most popular GNNs (cf. [24]) tend to implement message passing operations that aggregate neighbourhood information using a one-step neighbourhood-localized filter, which corresponds to the lowest frequency of the graph Laplacian.

Previous studies suggest the convolutional operation of such local-smoothing, which can be interpreted as low-pass filtering on the feature vectors [7, 8], forces a smooth embedding of neighbouring nodes, leading to information loss from message passing and severely degrading the performance [9, 10]. To assign different weights to the neighbouring nodes, graph attention networks (GATs) use attention layers to learn the adaptive weights across the edges, resulting in a leap in model capacity [3]. Such attention layers additionally increase the model interpretability. However, the resulting networks still rely on averaging neighboring node features for similarity computations, rather than leveraging more complex patterns.

In order to capture higher-order regularity on graphs, geometric scattering networks were recently introduced [1114]. These generalize the Euclidean scattering transform [1517] to the graph domain and leverage graph wavelets to extract effective and efficient graph representations. In [18], a hybrid scattering graph convolutional network (Sc-GCN) is proposed in order to tackle oversmoothing in traditional GCNs [2]. Geometric scattering together with GCN-based filters are used to apply both band-pass and low-pass filters to the graph signal. We note that some non-hybrid approaches have been proposed to learn band-pass filters via their spectral coefficients, but their advantages over smoothing (or low-pass) based architectures are inconclusive on node level tasks (see, e.g., studies in [19]). Furthermore, as shown in [18], the hybrid Sc-GCN approach significantly outperforms such approaches (in particular [6]) on several benchmarks. However, even though Sc-GCN achieves good performance on a range of node level classification tasks, it requires the selection of a task-appropriate configuration of the network and its scattering wavelet composition to carefully balance low-pass and band-pass information.

Here, we introduce a geometric scattering attention network (GSAN) that combines the hybrid Sc-GCN approach with a node-wise attention mechanism to automatically adapt its filter (or channel) composition, thus simplifying its architecture tuning. We evaluate our proposed approach on a variety of semi-supervised node classification benchmarks, demonstrating its performance improvement over previous GNNs. Analyzing the node-wise distributions of attention weights further enables a deeper understanding of the network mechanics by relating the node-level task-dependent information (label) with the corresponding feature selection.

2. PRELIMINARIES

We consider a weighted graph G = (V, E, w) with nodes V := {v1, …, vn} and (undirected) edges E ⊂ {{vi, vj} ∈ V × V, ij}. The function w : E → (0, ∞) assigns positive weights to the graph edges, which we aggregate in the adjacency matrix Wn×n via

W[vi,vj]:={w(vi,vj)if(vi,vj)E,0otherwise.

Each node viV possesses a feature vector xid0. These are aggregated in the feature matrix Xn×d0. We further define the degree matrix Dn×n, defined by D := D(W) := diag(d1, …, dn) with di:=deg(vi):=j=1nW[vi,vj] being the degree of the node vi and diag(.) a diagonal matrix parameterized by the diagonal elements. The GNN methods discussed in the following yield layer-wise node representations, compactly written as H()n×d for the th layer with H(0) := X.

2.1. Graph Convolutional Networks

A very popular method introduced in [2] connects the local node-based information in X with the intrinsic data-geometry encoded by W. This is realized by filtering the node features with the layer-wise update rule

H()=σ(AH(1)Θ()). (1)

The matrix multiplication with

A:=(D+In)1/2(W+In)(D+In)1/2

constitutes the filtering operation, while the multiplication with Θ() can be seen as a fully connected layer applied to the node features. Lastly an elementwise nonlinerity σ(.) is applied.

This method is subject to the so-called oversmooting problem [9], which causes the node features to be smoothed out, the more GCN layers are iterated. In signal processing terminology, the update rule can be interpreted as a low-pass filtering operation [18] so that the model cannot access a significant potion of the information considered in the frequency domain.

2.2. Geometric Scattering

Recently, geometric scattering was introduced to incorporate band-pass filters in GNNs [11, 18], inspired by the utilization of scattering features in the analysis of images [16, 20] and audio signals [17, 21]. Cascades of wavelets can often recover high-frequency information and geometric scattering exhibits an analogous property on graph domains.

Geometric scattering is based on the lazy random walk matrix P:=12(In+WD1), which is used to construct diffusion wavelet matrices Ψkn×n [22] of order k0,

{Ψ0:=InP,Ψk:=P2k1P2k,k1. (2)

For node features H, the scattering features are calculated as

UpH:=Ψkm|Ψkm1|Ψk2|Ψk1H|||,

with p:=(k1,,km)m0m parameterizing the sequence of wavelets, which are separated by elementwise absolute value operations. The layer-wise update rule has the form

H():=σ(UpH(1)Θ()). (3)

In [18], a hybrid architecture (referred to as Sc-GCN here) is proposed in order to combine the benefits of both GCN and scattering filters. Therefore, network channels {Hi()}i=1m, each coming from either GCN (Eq. 1) or scattering (Eq. 3), are concatenated (horizontally), constituting the hybrid layer

H():=[H1()Hm()].

2.3. Graph Residual Convolution

This architecture component from [18] constitutes an adjustable low-pass filter, parameterized by the matrix

Ares(α)=1α+1(In+αWD1),

which is usually applied to H() followed by a fully connected layer (without nonlinearity). It filters the hybrid layer output for high-frequency noise, which can occur as a result of scattering features.

2.4. Graph Attention Networks

Another popular approach for node classification tasks was introduced in [3], where at any node vi, an attention mechanism attends over the aggregation of node features from the node neighborhood Ni. The aggregation coefficients are learned via

αij=exp(LeakyReLU(aT[ΘhiΘhj])vkNiexp(LeakyReLU(aT[ΘhiΘhk]).

where hid is the feature of node i, Θd×d is the weight matrix and a2d is the attention vector. The output feature is hi'=σ(jNiαijΘhj). For more expressivity, multi-head attention is used to generate concatenated features,

hi'=k=1Kσ(jNiαijkΘkhj),

where K is the number of attention heads.

3. SCATTERING ATTENTION LAYER

Inspired by the recent work in Sec. 2, we introduce an attention framework to combine multiple channels corresponding to GCN and scattering filters while adaptively assigning different weights to them based on filtered node features. While our full network uses multi-head attention, we focus here on the processing performed independently by each attention head, deferring the multi-head configuration details to the general discussion of network architecture in the next section.

For every attention head, we first linearly transform the feature matrix H(−1) with a matrix Θ(), setting the transformed feature matrix to be H¯=H(1)Θ(). Then, based on the scattering GCN approach (Sec. 2.2 and [18]), a set of Cgcn GCN channels and Csct scattering channels are calculated,

{H¯gcn,1()=AH¯()H¯gcn,Cgcn()=ACgcnH¯()}GCNchannels,H¯sct,1()=|Up1H¯()|qH¯sct,Csct()=|UpCsctH¯()|q}scatteringchannels.

The channels H¯gcn,i() perform low-pass operations with different spatial support, aggregating information from 1,…,Cgcn-step neighborhoods, respectively, while H¯sct,k(), defined according to Eq. 3, enables band-pass filtering of graph signals.

Next, we compute attention coefficients that will be used in a shared attention layer to combine the filtered channels. In order to compute these node-wise attention coefficient for each channel, we first compute

egcn,i()=LeakyReLU([H¯()H¯gcn,i()]a),

with analogous esct,i() and a2d being a shared attention vector across all channels. We interpret egcn,j(), esct,k()n as score vectors indicating the importance of each channel.

Finally, the attention scores are normalized across all channels using the softmax function, yielding

αgcn,i()=exp(egnc,i())j=1Cgcnexp(egcn,j())+k=1Csctexp(esct,k())

with analogous αsct,i(). Note that the exponential function is applied elementwise here. To obtain comparable weights when aggregating the Cgcn + Csct =: C channels, we set

H()=C1σ(j=1Cgcnαgcn,j()H¯gcn,j()+k=1Csctαsct,k()H¯sct,k()),

where σ(.) = ReLU(.) is used as nonlinearity here.

4. SCATTERING ATTENTION NETWORK

The full network architecture in all examples shown here uses one scattering attention layer (as described in Sec. 3), applied to the input node features, followed by a residual convolution layer (see Sec. 2.3), which then produces the output via a fully connected layer. We note that deeper configurations are possible, especially when processing big graphs, but for simplicity, we focus on having a single architecture for the network. This is similar to the design choice utilized in GAT [3]. Figure 1 illustrates this network structure and further technical details on each of its components are provided below.

Fig. 1.

Fig. 1.

Illustration of the proposed network architecture.

Attention layer configuration.

In this work, for simplicity, we set Cgcn = Csct = 3, thus the attention layer combines three low-pass channels and three band-pass channels. The aggregation process of the attention layer is shown in Fig. 2, where U 1,2,3 represents three first-order scattering transformations with U1x := Ψ1x, U2x := Ψ2x and U3x := Ψ3x.

Fig. 2.

Fig. 2.

Illustration of the proposed scattering attention layer. Attention weights are computed from a concatenation of the transformed layer input H¯ together with filtered signals that are first computed from it and then used to produce the layer output H via an attention-weighted linear combination.

Multihead attention.

Similar to other applications of attention mechanisms [3], we use multi-head attention here for stabilizing the training, thus rewriting the output of the -th layer (by a slight abuse of notation) as

H()γ=1ΓH()[Θ()Θγ();α()αγ()], (5)

combining Γ attention heads, where Γ is tuned as a hyperparameter of the network.

Residual convolution.

To eliminate high frequency noise, the graph residual convolution (Sec. 2.3) is applied to the output of the concatenated multi-head scattering scattering attention layer (Eq. 5), with α tuned as a hyperparameter of the network.

5. RESULTS

To evaluate our geometric scattering attention network (GSAN), we apply it to semi-supervised node classification and compare its results on several benchmarks to two popular graph neural networks (namely GCN [2] and GAT [3]), as well as the original Sc-GCN, which does not utilize attention mechanisms and is instead tuned via extensive hyperparameter grid search. These methods are applied to eight benchmark datasets of varied sizes and homophily (i.e., average class similarity across edges), as shown in Tab. 1. Texas and Chameleon are low-homophily datasets where nodes correspond to webpages and edges to links between them, with classes corresponding to webpage topic or monthly traffic (discretized into five levels), respectively [23]. Wiki-CS is a recently proposed benchmark, where the nodes represent computer science articles and the edges represent the hyperlinks [24]. The rest of the datasets are citation networks from different sources (i.e., Cora, Citeseer, Pubmed, DBLP), where nodes correspond to papers and edges to citations [25, 26]. CoraFull is the larger version of the Cora dataset [27].

Table 1.

Dataset characteristics & comparison of node classification test accuracy. Datasets are ordered by increasing homophily.

Dataset Classes Nodes Edges Homophily GCN GAT Sc-GCN GSAN (ours)
Texas 5 183 295 0.11 59.5 58.4 60.3 60.5
Chameleon 5 2,277 31,421 0.23 28.2 42.9 51.2 61.2
CoraFull 70 19,793 63,421 0.57 62.2 51.9 62.5 64.3
Wiki-CS 10 11,701 216,123 0.65 77.2 77.7 78.1 78.6
Citeseer 6 3,327 4,676 0.74 70.3 72.5 71.7 71.3
Pubmed 3 19,717 44,327 0.80 79.0 79.0 79.4 79.8
Cora 7 2,708 5,276 0.81 81.5 83.0 84.2 84.0
DBLP 4 17,716 52,867 0.83 59.3 66.1 81.5 82.6

All datasets are split into train, validation and test sets. The validation set is used for hyperparameter selection via grid search, including the number of heads Γ, the residual parameter α and the channel widths (i.e., number of neurons).

The results in Tab. 1 indicate that we improve upon previous methods, including the hybrid Sc-GCN that requires more intricate hyperparameter tuning to balance low-pass and band-pass channels [18]. Since the attention weights α are computed separately for every node (see Sec. 3), they can also help understand the utilization of different channels in different regions of the graph. To demonstrate such analysis, we consider here the ratio between node-wise attention assigned to band-pass and low-pass channels. Over nodes and heads, we sum up the total attention i=1Csct1nTαsct,i in the three scattering channels U1,2,3, and i=1Cgcn1nTαgcn,i in the three GCN channels A1,2,3. Finally, we compute the band-pass vs. low-pass ratio i=1Csct1nTαsct,ii=1Cgcn1nTαgcn,i. Figure 3 demonstrates the additional insight provided by the distribution of these attention scores on four of the benchmark datasets. Wider spread, indicating highly varied channel utilization, is exhibited by DBLP and Chameleon where GSAN achieves significant improvement over GCN and GAT. Further, the improvement of GSAN over Sc-GCN on Chameleon highlights the importance of the node-wise feature selection in low-homophily settings. While Citeseer and Wiki-CS exhibit smaller spreads, the latter attributes more attention to band-pass channels, which we interpret as related to lower homophily.

Fig. 3.

Fig. 3.

Distribution of attention ratio between band-pass (scattering) and low-pass (GCN) channels across nodes and heads in DBLP, Chameleon, Citeseer, and WikiCS.

6. CONCLUSIONS

The presented geometric scattering attention network (GSAN) introduces a new approach that leverages node-wise attention to incorporate both geometric scattering [1114] and GCN [2] channels to form a hybrid model, further advancing the recently proposed Sc-GCN [18]. Beyond its efficacy in semi-supervised node classification, the distribution of its learned attention scores provides a promising tool to study the spectral composition of information extracted from node features. We expect this to enable future work to distill tractable notions of regularity on graphs to better understand and leverage their intrinsic structure in geometric deep learning, and to incorporate such attention mechanisms in spectral GNNs to both learn filter banks and perform node-wise selection of specific filters used in each local region of the graph.

Acknowledgments

This work was partially funded by IVADO Professor startup & operational funds, IVADO Fundamental Research Proj. grant PRF-2019–3583139727, and NIH grant R01GM135929. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

7. REFERENCES

  • [1].Bronstein Michael M., Bruna Joan, LeCun Yann, Szlam Arthur, and Vandergheynst Pierre, “Geometric deep learning: Going beyond Euclidean data,” IEEE Sig. Proc. Mag, vol. 34, no. 4, pp. 18–42, 2017. [Google Scholar]
  • [2].Kipf Thomas N. and Welling Max, “Semi-supervised classification with graph convolutional networks,” in the 4th ICLR, 2016.
  • [3].Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Pietro Liò, and Bengio Yoshua, “Graph attention networks,” in the 6th ICLR, 2018.
  • [4].Hamilton Will, Ying Zhitao, and Leskovec Jure, “Inductive representation learning on large graphs,” in Advances in NeurIPS, 2017, vol. 30, pp. 1024–1034. [Google Scholar]
  • [5].Bruna Joan, Zaremba Wojciech, Szlam Arthur, and LeCun Yann, “Spectral networks and locally connected networks on graphs,” arXiv:1312.6203, 2013. [Google Scholar]
  • [6].Defferrard Michaël, Bresson Xavier, and Vandergheynst Pierre, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in NeurIPS, 2016, vol. 29. [Google Scholar]
  • [7].Hoang NT and Maehara Takanori, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv:1905.09550, 2019. [Google Scholar]
  • [8].Dwivedi Vijay Prakash, Joshi Chaitanya K, Laurent Thomas, Bengio Yoshua, and Bresson Xavier, “Benchmarking graph neural networks,” arXiv:2003.00982, 2020. [Google Scholar]
  • [9].Li Qimai, Han Zhichao, and Wu Xiao-Ming, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proc. of the 32nd AAAI Conf. on AI, 2018. [Google Scholar]
  • [10].Oono Kenta and Suzuki Taiji, “Graph neural networks exponentially lose expressive power for node classification,” in the 7th ICLR, 2019.
  • [11].Gao Feng, Wolf Guy, and Hirn Matthew, “Geometric scattering for graph data analysis,” in Proceedings of the 36th ICML, 2019, pp. 2122–2131. [Google Scholar]
  • [12].Gama Fernando, Ribeiro Alejandro, and Bruna Joan, “Diffusion scattering transforms on graphs,” in the 7th ICLR, 2019.
  • [13].Gama Fernando, Ribeiro Alejandro, and Bruna Joan, “Stability of graph scattering transforms,” in Advances in NeurIPS, 2019, vol. 32, pp. 8038–8048. [Google Scholar]
  • [14].Zou Dongmian and Lerman Gilad, “Graph convolutional neural networks via scattering,” Applied and Computational Harmonic Analysis, vol. 49, no. 3, pp. 1046–1074, 2020. [Google Scholar]
  • [15].Mallat Stéphane, “Group invariant scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012. [Google Scholar]
  • [16].Bruna Joan and Mallat Stéphane, “Invariant scattering convolution networks,” IEEE Trans. on Patt. Anal. and Mach. Intel, vol. 35, no. 8, pp. 1872–1886, August 2013. [DOI] [PubMed] [Google Scholar]
  • [17].Andén Joakim and Mallat Stéphane, “Deep scattering spectrum,” IEEE Trans. on Sig. Proc, vol. 62, no. 16, pp. 4114–4128, August 2014. [Google Scholar]
  • [18].Min Yimeng, Wenkel Frederik, and Wolf Guy, “Scattering gcn: Overcoming oversmoothness in graph convolutional networks,” Advances in NeurIPS, vol. 33, 2020. [PMC free article] [PubMed] [Google Scholar]
  • [19].Bianchi Filippo Maria, Grattarola Daniele, Livi Lorenzo, and Alippi Cesare, “Graph neural networks with convolutional arma filters,” IEEE Trans. on Patt. Anal. and Mach. Intel, 2021. [DOI] [PubMed] [Google Scholar]
  • [20].Sifre Laurent and Mallat Stéphane, “Rotation, scaling and deformation invariant scattering for texture discrimination,” in the 2013 CVPR, June 2013.
  • [21].Lostanlen Vincent and Mallat Stéphane, “Wavelet scattering on the pitch spiral,” in Proc. of the 18th International Conference on Digital Audio Effects, 2015, pp. 429–432. [Google Scholar]
  • [22].Coifman RR and Maggioni M, “Diffusion wavelets,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 53–94, 2006. [Google Scholar]
  • [23].Pei Hongbin, Wei Bingzhe, Chang Kevin Chen-Chuan, Lei Yu, and Yang Bo, “Geom-gcn: Geometric graph convolutional networks,” in the 7th ICLR, 2019.
  • [24].Mernyei Péter and Cangea Cătălina, “Wiki-cs: A wikipedia-based benchmark for graph neural networks,” arXiv:2007.02901, 2020. [Google Scholar]
  • [25].Yang Zhilin, Cohen William, and Salakhudinov Ruslan, “Revisiting semi-supervised learning with graph embeddings,” in Proceedings of the 33rd ICML, 2016, vol. 48 of PMLR, pp. 40–48. [Google Scholar]
  • [26].Pan Shirui, Wu Jia, Zhu Xingquan, Zhang Chengqi, and Wang Yang, “Tri-party deep network representation,” in Proc. of the 25th IJCAI, 2016, pp. 1895–1901. [Google Scholar]
  • [27].Bojchevski Aleksandar and Günnemann Stephan, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in the 6th ICLR, 2018.

RESOURCES