Homophily modulates double descent generalization in graph convolution networks

Cheng Shi; Liming Pan; Hong Hu; Ivan Dokmanić

doi:10.1073/pnas.2309504121

. 2024 Feb 12;121(8):e2309504121. doi: 10.1073/pnas.2309504121

Homophily modulates double descent generalization in graph convolution networks

Cheng Shi ^a, Liming Pan ^b,^c,¹, Hong Hu ^d, Ivan Dokmanić ^a,^e,¹

PMCID: PMC10895367 PMID: 38346190

Significance

Graph neural networks (GNNs) have been applied with great success across science and engineering, but we do not understand why they work so well. Motivated by experimental evidence of a rich phase diagram of generalization behaviors, we analyzed simple GNNs on a community graph model and derived precise expressions for generalization error as a function of noise in the graph, noise in the features, proportion of labeled data, and the nature of interactions in the graph. Computer experiments show that the analysis also qualitatively explains large “production-scale” networks and can thus be used to improve performance and guide hyperparameter tuning. This is significant both for the downstream science and for the theory of deep learning on graphs.

Keywords: graph neural network, statistical mechanics, homophily, double descent, stochastic block model

Abstract

Graph neural networks (GNNs) excel in modeling relational data such as biological, social, and transportation networks, but the underpinnings of their success are not well understood. Traditional complexity measures from statistical learning theory fail to account for observed phenomena like the double descent or the impact of relational semantics on generalization error. Motivated by experimental observations of “transductive” double descent in key networks and datasets, we use analytical tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model. Our results illuminate the nuances of learning on homophilic versus heterophilic data and predict double descent whose existence in GNNs has been questioned by recent work. We show how risk is shaped by the interplay between the graph noise, feature noise, and the number of training labels. Our findings apply beyond stylized models, capturing qualitative trends in real-world GNNs and datasets. As a case in point, we use our analytic insights to improve performance of state-of-the-art graph convolution networks on heterophilic datasets.

Graph neural networks (GNNs) recently achieved impressive results on problems as diverse as weather forecasting (1), predicting forces in granular materials (2), or understanding biological molecules (3–5). They have become the de facto machine learning model for datasets with relational information such as interactions in protein graphs or friendships in a social network (6–9). These remarkable successes triggered a wave of research on better, more expressive GNN architectures for diverse tasks, yet there is little theoretical work that studies why and how these networks achieve strong performance.

In this paper, we study generalization in graph neural networks for transductive (semi-supervised) node classification: given a graph $G = (V, E)$ , node features $x : V \to R^{F}$ , and labels $y : V_{train} \to {- 1, 1}$ for a “training” subset of nodes $V_{train} \subset V$ , we want to learn a rule which assigns labels to nodes in $V_{test} = V ∖ V_{train}$ . This setting exhibits a richer generalization phenomenology than the usual supervised learning: in addition to the quality and dimensionality of features associated with data, the generalization error is affected by the quality of relational information (are there missing or spurious edges?), the proportion of observed labels $| V_{train} | / | V |$ , and the specifics of interaction between the graph and the features. Additional complexity arises because links in different graphs encode qualitatively distinct semantics. Interactions between proteins are heterophilic; friendships in social networks are homophilic (10). They result in graphs with different structural statistics, which in turn modulate interactions between the graphs and the features (11, 12). Whether and how these factors influence learning and generalization is currently not understood. Outstanding questions include the role of overparameterization and the differences in performance on graphs with different levels of homophily or heterophily. Despite much work showing that in overparameterized models the traditional bias–variance tradeoff is replaced by the so-called double descent, there have been no reports nor analyses of double descent in transductive graph learning. Recent work speculates that this is due to implicit regularization (13).

Toward addressing this gap, we derive a precise characterization of generalization in simple graph convolution networks (GCNs) in semi-supervised^* node classification on random community graphs. We motivate this setting by first presenting a sequence of experimental observations that point to universal behaviors in a variety of GNNs on a variety of domains.

In particular, we argue that in the transductive setting a natural way to “diagnose” double descent is by varying the number of labels available for training (Section 1). We then design experiments that show that double descent is in fact ubiquitous in GNNs: There is often a counterintuitive regime where more training data hurts generalization (14). Understanding this regime has important implications for the (often costly) label collection and questions of observability of complex systems (15). While earlier work reports similar behavior in standard supervised learning, our transductive version demonstrates it directly (14, 16). On the other hand, we indeed find that for many combinations of relational datasets and GNNs, double descent is mitigated by implicit or explicit regularization. Interestingly, the risk curves are affected not only by the properties of the models and data (14), but also by the level of homophily or heterophily in the graphs.

Motivated by these findings we then present our main theoretical result: a precise analysis of generalization on the contextual stochastic block model (CSBM) with a simple GCN. We combine tools from statistical physics and random matrix theory and derive generalization curves either in closed form or as solutions to tractable low-dimensional optimization problems. To carry out our theoretical analysis, we formulate a universality conjecture which states that in the limit of large graphs, the risks in GCNs with polynomial filters do not change if we replace random binary adjacency matrices with random Gaussian matrices. We empirically verify the validity of this conjecture in a variety of settings; we think it may serve as a starting point for future analyses of deep GNNs.

These theoretical results allow us to effectively explore a range of questions: For example, in Section 3 we show that double descent also appears when we fix the (relative) number of observed labels, and vary relative model complexity (Fig. 5). This setting is close but not identical to the usual supervised double descent (17). We also explain why self-loops improve performance of GNNs on homophilic (18) but not heterophilic (11, 12) graphs, as empirically established in a number of papers, but also that negative self-loops benefit learning on heterophilic graphs (19, 20). We then go back to experiment and show that building negative self-loop filters into state-of-the-art GCNs can further improve their performance on heterophilic graphs. This can be seen as a theoretical GCN counterpart of recent observations in the message passing literature (19, 20) and an explicit connection with heterophily for architectures such as GraphSAGE which can implement analogous logic (9).

Fig. 5. — Test risk as a function of relative model complexity $α = γ^{- 1}$ : different levels of homophily lead to distinct types of double descent in CSBM. Plots from *Left* to *Right* (with increasing $λ$ ) show curves for graphs of decreasing randomness. Varying model complexity in GNNs yields nonmonotonic curves similar to those in the earlier studies of double descent studies in supervised (inductive) learning. Note that the overall shape of the curve is strongly modulated by the degree of homophily in the graph.

Existing studies of generalization in graph neural networks rely on complexity measures like the VC-dimension or Rademacher complexity but they result in vacuous bounds which do not explain the observed phenomena (21–23). Further, they only indirectly address the interaction between the graph and the features. This interaction, however, is of key importance: an Erdős–Renyi graph is not likely to be of much use in learning with a graph neural network. In reality both the graph and the features contain information about the labels; learning should exploit the complementarity of these two views.

Instead of applying the “big hammers” of statistical learning theory, we adopt a statistical mechanics approach and study performance of simple graph convolution networks on the CSBM (24). We derive precise expressions for the learning curves which exhibit a rich phenomenology.

The two ways to think about generalization, statistical learning theory and statistical mechanics, have been contrasted already in the late 1980s and the early 1990s. Statistical mechanics of learning, developed at that time by Gardner, Opper, Sejnowski, Sompolinsky, Tishby, Vallet, Watkin, and many others—an excellent account is given in the review paper by Watkin et al. (25)—must make more assumptions about the data and the space of admissible functions, but it gives results that are more precise and more readily applied to the practice of machine learning.

These dichotomies have been revisited recently in the context of deep learning and highly overparameterized models by Martin and Mahoney (26), in reaction to Zhang et al.’s thought provoking “Understanding deep learning requires rethinking generalization” (27) which shows, among other things, that modern deep neural networks easily fit completely random labels. Martin and Mahoney explain that such seemingly surprising new behaviors can be effectively understood within the statistical mechanics paradigm by identifying the right order parameters and related phase diagrams. We explore these connections further in Section 4—Discussion.

Outline.

We begin by describing the motivational experimental findings in Section 1. We identify the key trends to explain, such as the dependence of double descent generalization on the level of noise in features and graphs. In Section 2, we introduce our analytical model: a simple GCN on the contextual stochastic block model. Section 3 then explores the implications of some of the analytical findings about self-loops and heterophily on the design of state-of-the-art GCNs. We follow this by a discussion of our results in the context of related work in Section 4. In Section 5, we explain the analogies between GCNs and spin glasses which allow us to apply analysis methods from statistical physics and random matrix theory. We follow with a few concluding comments in Section 6.

1. Motivation: Empirical Results

Given an $N$ -vertex graph $G = (V, E)$ with an adjacency matrix $A \in {0, 1}^{N \times N}$ and features $X \in R^{N \times F}$ , a node classification GNN is a function $(A, X) \mapsto h (w ; A, X)$ insensitive to vertex ordering: for any node permutation $π$ , $h (w ; π A π^{⊺}, π X) = π h (w ; A, X)$ . We are interested in the behavior of train and test risk,

\begin{matrix} R_{N} (S) = \frac{1}{| S |} \sum_{i \in S} ℓ (y_{i}, h_{i} (w^{*} ; A, X)), \end{matrix}

[1]

with $S \in {V_{train}, V_{test}}$ and $ℓ (\cdot, \cdot)$ a loss metric such as the mean-squared error (MSE) or the cross-entropy. The optimal network parameters $w^{*}$ are obtained by minimizing the regularized loss

\begin{matrix} L_{N} (w) = \frac{1}{| V_{train} |} \sum_{i \in V_{train}} ℓ (y_{i}, h_{i} (w ; A, X)) + r_{N} (w), \end{matrix}

[2]

where $r_{N} (w)$ is a regularizer.

A. Is Double Descent Absent in GNNs?

We start by investigating the lack of reports of double descent in transductive learning on graphs. Double descent is a scaling of test risk with model complexity which is rather different from the textbook bias–variance tradeoff (16, 28). Up to the interpolation point, where the model has sufficient capacity to fit the training data without error, things behave as usual, with the test risk first decreasing together with the bias and then increasing with the variance due to overfitting. But increasing complexity beyond the interpolation point—into an overparameterized region characteristic for modern deep learning—may make the test risk decrease again.

This generalization behavior has been identified already in the 90s by applying analytical tools from statistical mechanics to problems of machine learning; see for example figure 10 in the paper by Watkin et al. (25) or figure 1 in Opper et al. (29) which show the generalization ability of the so-called pseudoinverse algorithm to train a Boolean linear classifier (30). It is implicit in work on phase diagrams of generalization akin to those for magnetism or the Sherrington–Kirkpatrick model (31, 32).

While these works are indeed the first to observe double descent, its significance for modern machine learning has been recognized by a line of research starting with (33). Double descent has been observed in complex deep neural networks (14) and theoretically analyzed for a number of machine learning models (17, 25, 30, 34, 35). There are, however, scarcely any reports of double descent in graph neural networks. Oono and Suzuki (13) speculate that this may be due to implicit regularization in relation to the so-called oversmoothing (36).

B. Generalization in Supervised vs. Transductive Learning.

When illustrating double descent the test error is usually plotted against model complexity. For this to make sense, the amount of training data must be fixed, so the complexity on the abscissa is really relative complexity; denoting the size of the dataset (node of nodes) by $N$ and the number of parameters by $F$ we let this relative complexity be $α : = F / N$ . An alternative is to plot the risk against $γ = α^{- 1}$ : Starting from a small amount of data (small $γ$ ), we first go through a regime in which increasing the amount of training data leads to worse performance. In our context this can be interpreted as varying the size of the graph while keeping the number of features fixed.

In transductive node classification, we always observe the entire graph $A$ and the features associated with all vertices $X$ , but only a subset of $M$ labels. It is then more natural to vary $τ : = M / N$ than $α^{- 1}$ , with $M$ being the number of observed labels. Although the resulting curves are slightly different, they both exhibit double descent; in the terminology of Martin and Mahoney, both $τ$ and $α^{- 1}$ may be called load-like parameters (26); see also ref. 37.^† In particular, they both have the interpolation peak at $τ = α^{- 1}$ , or $M = F$ , when the system matrix becomes square and poorly conditioned. The key aspect of double descent is that the generalization error decreases on both sides of the interpolation peak.

Using $τ$ instead of $α^{- 1}$ is convenient for several reasons: In real datasets, the number of input features is fixed; we cannot vary it. Further, there is no unique way to increase the number of parameters in a GNN and different GNNs are parameterized differently which complicates comparisons. Varying depth may lead to confounding effects such as oversmoothing which is implicit regularization. Varying $τ$ is a straightforward and clean way to compare different architectures in analogous settings. We can, however, easily vary $α = γ^{- 1}$ in our analytic model described in Section 2; we show the related results in Fig. 5.

C. Experimental Observation of Double Descent in GNNs.

Armed with this understanding, we design an experiment as follows: We study the homophilic citation graph Cora (38) and the heterophilic graphs of Wikipedia pages Chameleon (39) and university web pages Texas (11). We apply different graph convolution networks with different losses, with and without dropout regularization.

Results are shown in Fig. 1. Importantly, we plot both the test error (red) and the test accuracy (black) in node classification against a range of training label ratios $τ$ . In the first column, we use a one-layer GCN similar to the one we analyze theoretically in Section 2, but with added degree normalization, self-loops, and multiple classes; in the second column, we use a two-layer GCN; in the third column we add dropout; in the fourth, we use the cross-entropy loss instead of the MSE. This last model is used in the pytorch-geometric node classification tutorial.^‡

First, with a one-layer network, one can clearly observe transductive double descent on Cora in both the test risk and accuracy. The situation is markedly different on the heterophilic Texas, which contains only 183 nodes but 1,703 features per node which yields relative model complexity $α = F / N$ much higher than for other datasets. Here the test accuracy decreases near-monotonically, consistently with our theoretical analysis in Section 2 (cf. Fig. 4D). In this setting, strong regularization improves performance.

Fig. 4. — Four typical generalization curves in CSBM model. The solid lines represent theoretical results of test risk (black) and accuracy (red) computed via Eq. 17. We also plot the mean and variance of test output $h_{i} (w^{*})$ where $i \in V_{test}$ . This illustrates how the tradeoff of mean–variance leads to different double descent curves. Note we only display results for nodes with label $y_{i} = 1$ ; the result for the $y_{i} = - 1$ class simply has opposite mean and identical variance. (A) monotonic $ACC$ (increasing) and $R_{test}$ (decreasing) when regularization $r$ is large; (B) A typical double descent with small regularization $r$ ; (C) slight double descent with relative model complexity $α$ close to $1$ ; (D) (near-monotonically) decreasing $ACC$ and increasing $R_{test}$ with large relative model complexity $α = 1 / γ$ . The parameters are chosen as (A) $μ = 1, λ = 2, γ = 5, r = 2$ ; (B) $μ = 1, λ = 2,$ -pagination $γ = 5, r = 0.1$ ; (C) $μ = 1, λ = 2, γ = 1.2, r = 0.05$ ; (D) $μ = 5, λ = 1, γ = 0.1, r = 0.005$ . The solid circles and vertical bars represent the mean and SD of risk and accuracy from experiment results. Each experimental data point is averaged over $10$ independent trials; the SD is indicated by vertical bars. We use $N$ = 5,000 and $d = 30$ for A, B and C, and $N = 500$ and $d = 20$ for (D). In all cases, we use the symmetric binary adjacency matrix ensemble $A^{bs}$ .

With a two-layer network the double descent still “survives” in the test error on Cora, but the accuracy is almost monotonically increasing except on Texas. These results corroborate the intuition that dropout and nonlinearity alleviate GNN overfitting on node classification, especially for large training label ratios.

We then explore the role of noise in the graph and in the features by manually adding noise to Cora. We randomly remove $30 %$ of the links and add the same number of random links, and randomize $30 %$ of the entries in $X$ ; results are shown in the fourth row of Fig. 1. The double descent in test error appears even with substantial regularization. Comparing the first and the fourth row affirms that double descent is more prominent with noisy data; this is again consistent with our analysis (Section 3). In the last row, we apply the networks to the synthetic CSBM. Observing the same qualitative behavior also in this case lends credence to the choice of CSBM for our precise analysis in Section 2.

In Fig. 2, we further focus on the strongly heterophilic Chameleon which does not clearly show double descent in Fig. 1. We randomly perturb different percentages of edges and in addition to GCNs also use the considerably more powerful FSGNN (40), which achieves current state-of-the-art results on Chameleon. Again, we see that double descent (a nonmonotonic risk curve) emerges at higher noise (weaker heterophily). It is noteworthy that more expressive architectures do seem to mitigate double descent; conversely, a one-layer GCN exhibits double descent even without additional noise. We analytically characterize this phenomenon in Section 2 and illustrate it in Fig. 3. Beyond GCNs, we show that double descent occurs in more sophisticated GNNs like graph attention networks (41), GraphSAGE (9) and Chebyshev graph networks (7); see SI Appendix, 5 for details.

Fig. 3. — Theoretical results computed by the replica method (solid line) versus experimental results (solid circles) on CSBM, with $P (A) = A$ , for varying training label ratios $τ$ . (A) training and test risks with $λ = μ = 1$ , $γ = 5$ and $r = 0$ . (For $τ < 0.2$ , we use the pseudoinverse in Eq. 11 in numerics and $r = 10^{- 5}$ for the theoretical curves). We further study the impact of varying $λ$ in (B) and $r$ in (C). We set $r = 0.02$ , $γ = 2$ , $μ = 1$ in (B) and $λ = 3$ , $μ = 1$ , $γ = 2$ in (C). In all experiments we set $N$ = 5,000 and $d = 30$ . We work with the symmetric binary adjacency matrix ensemble $A^{bs}$ . Each experimental data point is averaged over $10$ independent trials; their SD is shown by vertical bars. The theoretical curves agree perfectly with experiments but also qualitatively with the phenomena we observed on real data in Section 1.

In summary, transductive double descent occurs in a variety of graph neural networks applied to real-world data, with noise and implicit or explicit regularization being the key determinants of the shape of generalization curves. Understanding the behavior of generalization error as a function of the number of training labels is of great practical value given the difficulty of obtaining labels in many domains. For some datasets like Texas, using too many labels seems detrimental for some architectures.

2. A Precise Analysis of Node Classification on CSBM with a Simple Graph Convolution Network

Motivated by the above discussions, we turn to a theoretical study of the performance of GCNs on random community graphs where we can understand the influence of all the involved parameters. We have seen in Section 1 that the generalization behavior in this setting qualitatively matches generalization on real data.

Graph convolution networks are composed of graph convolution filters and nonlinear activations. Removing the activations results in a so-called simple GCN (42) or a spectral GNN (43, 44). For a graph $G = (V, E)$ with adjacency matrix $A$ and features that live on the nodes $X$ ,

\begin{matrix} h (w ; A, X) = P (A) X w where P (A) = \sum_{k = 0}^{K} c_{k} A^{k}, \end{matrix}

[3]

where $w \in R^{F}$ are trainable parameters and $K$ is the filter support size in terms of hops on the graph. We treat the neighborhood weights $c_{k}$ at different hops as hyperparameters. We let $A^{0} \overset{def}{=} I_{N}$ so that the model Eq. 3 reduces to ordinary linear regression when $K = 0$ .

In standard feed-forward networks, removing the activations results in a linear end-to-end mapping. Surprisingly, GCNs without activations such as SGC (42) or with activations only in the output such as FSGNN (40) and GPRGNN (12) achieve state-of-the-art performance in many settings.^§

We will derive test risk expressions for the above graph convolution network in two shallow cases: $P (A) = A$ and $P (A) = A + c I$ . We will also state a universality conjecture for general polynomial filters. Starting with this conjecture, we can in principle extend the results to all polynomial filters using routine but tedious computation. We provide an example for the training error of a two-hop network in SI Appendix, 3. As we will show, this analytic behavior closely resembles the motivational empirical findings from Section 1.

A. Training and Generalization.

We are interested in the large graph limit $N \to \infty$ where the training label ratio $| V_{train} | / N \to τ$ . We fit the model parameters $w$ by ridge regression $w^{*} : = arg m i n_{w} L_{A, X} (w)$ , where

\begin{matrix} L_{A, X} (w) = \frac{1}{| V_{train} |} \sum_{i \in V_{train}} {(y_{i} - (h_{i} (w ; A, X)))}^{2} + \frac{r}{N} {‖ w ‖}_{2}^{2} . \end{matrix}

[4]

We are interested in the training and test risk in the limit of large graphs,

\begin{matrix} R_{train} = lim_{N \to \infty} E R_{N} (V_{train}), R_{test} = lim_{N \to \infty} E R_{N} (V_{test}), \end{matrix}

[5]

as well as in the expected accuracy,

\begin{matrix} ACC = lim_{N \to \infty} E [\frac{1}{| V_{test} |} \sum_{i \in V_{test}} 1 {y_{i} = sign (h_{i} (w^{*}))}] . \end{matrix}

[6]

We will sometimes write $R_{train} (A)$ , $R_{test} (A)$ , $ACC (A)$ to emphasize that the matrix $A$ in Eq. 3 follows a distribution $A$ , $A \sim A$ . The expectations are over the random graph adjacency matrix $A$ , random features $X$ , and the uniformly random test–train partition $V = V_{train} \cup V_{test}$ . Our analysis in fact shows that the quantities all concentrate around the mean for large $N$ (and $M$ and $F$ ): In the language of statistical physics, they are self-averaging. This proportional asymptotics regime where $F, M$ , and $N$ all grow large at constant ratios is more challenging to analyze than the regimes where dataset size or model complexity is constant, but it results in phenomena we see with production-scale machine learning models on real data; see also refs. 26 and 34.

B. Contextual Stochastic Block Model.

We apply the GCN to the CSBM. CSBM adds node features to the stochastic block model (SBM)—a random community graph model (24) where the probability of a link between nodes depends on their communities. The lower triangular part of the adjacency matrix $A^{bs}$ has distribution

\begin{matrix} P (A_{ij}^{bs} = 1) = \{\begin{matrix} c_{in} / N & if i \leq j and y_{i} = y_{j} \\ c_{out} / N & if i \leq j and y_{i} \neq y_{j} . \end{matrix} \end{matrix}

[7]

A convenient parameterization is

\begin{matrix} c_{in} = d + \sqrt{d} λ, c_{out} = d - \sqrt{d} λ, \end{matrix}

where $d$ is the average node degree and the sign of $λ$ determines whether the graph is homophilic or heterophilic; $| λ |$ can be regarded as the graph signal noise ratio (SNR).

We will also study a directed SBM (45, 46) with adjacency matrix distributed as

\begin{matrix} P (A_{ij}^{bn} = 1) = \{\begin{matrix} c_{in} / N & if y_{i} = y_{j} \\ c_{out} / N & if y_{i} \neq y_{j} . \end{matrix} \end{matrix}

[8]

Many real graphs have directed links, including chemical connections between neurons, the electric grid, followee–follower relation in social media, and Bayesian graphs. In our case the directed SBM facilitates analysis with self-loops while exhibiting the same qualitative behavior and phenomenology as the undirected one.

The features of CSBM follow the spiked covariance model,

\begin{matrix} x^{i} = \sqrt{\frac{μ}{N}} y_{i} u + ξ^{i}, \end{matrix}

[9]

where $u \sim N (0, I_{F} / F)$ is the $F$ -dimensional hidden feature and $ξ^{i} \sim N (0, I_{F} / F)$ are i.i.d. Gaussian noise; the parameter $μ$ is the feature SNR. We work in the proportional scaling regime where $\frac{N}{F} \to γ$ , with $γ$ being the inverse relative model complexity, and ascribe feature vectors to the rows of the data matrix $X$ ,

\begin{matrix} X = {[x^{1}, \dots, x^{N}]}^{⊺} = \sqrt{\frac{μ}{N}} y u^{⊺} + Ξ^{x} . \end{matrix}

[10]

We assume throughout that the two communities are balanced; without loss of generality we let $y_{i} = 1$ for $i = 1, 2, \dots, N / 2$ and $y_{i} = - 1$ for $i > N / 2$ .

We will show that CSBM is a comparatively tractable statistical model to characterize generalization in GNNs. Intuitively, when $N \to \infty$ , the risk should concentrate around a number that depends on five parameters:

\begin{matrix} λ & Degree of homophily (Graph SNR), \\ μ & Feature SNR, \\ α & Relative model complexity (= γ^{- 1}), \\ τ & Label ratio, \\ r & Ridge regularization parameter. \end{matrix}

We emphasize that we study the challenging weak-signal regime where $λ$ , $μ$ and $γ$ do not scale with $N$ (but $F$ does). This stands in contrast to recent machine learning work on CSBM (47, 48) which studies the low-noise regime where $μ$ or $λ^{2}$ scale with $N$ , or even the noiseless regime where the classes become linearly separable after applying a graph filter or a GCN. We argue that the weak-signal regime is closer to real graph learning problems which are neither too easy (as in linearly separable) nor too hard (as with a vanishing signal). The fact that we identify phenomena which occur in state-of-the-art networks and real datasets supports this claim.

We outline our analysis in Section 5 and provide the details in SI Appendix. But first, in the following section, we show that the derived expressions precisely characterize generalization of shallow GCNs on CSBM and also give a correct qualitative description of the behavior of “big” graph neural networks on complex datasets, pointing to interesting phenomena and interpretations.

3. Phenomenology of Generalization in GCNs

We focus on the behavior of the test risk under various levels of graph homophily, emphasizing two main aspects: i) different levels of homophily lead to different types of double descent; ii) self-loops, standard in GCNs, create an imbalance between heterophilic and homophilic datasets; negative self-loops improve the handling of heterophilic datasets.

A. Double Descent in Shallow GCNs on CSBM.

As we show in Section 5 and SI Appendix, 1, the expression for the test risk for unregularized regression ( $r = 0$ ) with shallow GCN can be obtained in closed form as

R_{test} = \frac{γ τ (γ + μ)}{(γ τ - 1) (γ + λ^{2} (μ + 1) + μ)}

when $γ τ > 1$ . It is evident that the denominator vanishes as $γ τ$ approaches 1. When this happens, the system matrix $I_{train} P (A) X$ , where $I_{train}$ selects the rows for which we have labels; see Section 5, Eq. 12, is square and near-singular for large $N$ , which leads to the explosion of $R_{test}$ (Fig. 3A). When relative model complexity is high, i.e., $γ = N / F < 1$ is low, $τ γ$ is always less than $1$ . In such cases, no interpolation peak appears, which is consistent with our experimental results for the Texas dataset where $γ = 0.11$ ; cf. Fig. 4D and the third row of Fig. 1.

At the other extreme, for strongly regularized training (large $r$ ) the double descent disappears (Fig. 3C); it has been shown that this happens at optimal regularization (35, 49). The absolute risk values in Fig. 3 B and C show the same behavior.

Fig. 3B shows that when the graph is very noisy ( $λ$ is small) the test error starts to increase as soon as the training label ratio $τ$ increases from $0$ . When $λ$ is large, meaning that the graph is discriminative, the test error first decreases and then increases. Similar behavior can be observed when varying the feature SNR $μ$ instead of $λ$ . Double descent also appears in test accuracy (Fig. 4).

While these curves all illustrate double descent in the sense that they all have the interpolation peak on both sides of which the error decreases, they are qualitatively different. The emergence of these different shapes can be explained by looking at the distribution of the predicted $i$ th label $h_{i} (w^{*})$ . As we show in SI Appendix, 1, $h_{i} (w^{*})$ is normally distributed with mean and variance given by the solutions of a saddle point equation outlined in Section 5. The test accuracy can thus be expressed by the error function (cf. Eq. 17).

As we increase the number of labels, the mean $E [h_{i} (w^{*})]$ approaches $y_{i}$ monotonically. However, the variance $Var [h_{i} (w^{*})]$ behaves differently for different model complexities $α = \frac{1}{γ}$ and regularizations $r$ , resulting in distinct double descent curves.

For example, when $r \to 0$ and $τ \to \frac{1}{γ}$ , the variance of $h_{i} (w^{*})$ for $i \in V_{test}$ diverges and the accuracy approaches $50 %$ , a random guess. On the other hand, when $r$ is large, the variance is small and double descent is mild or absent, as shown in Fig. 4A. Fig. 4B shows a typical double descent curve with two regimes where additional labels hurt generalization. In Fig. 4C we also see a mild double descent when the relative model complexity is close to $1$ : this is consistent with experimental observations on Cora in Fig. 1. In certain extremal cases, for example when $γ$ is very small, the test accuracy continuously decreases after a very small ascent around $τ = 0$ (Fig. 4D); this is consistent with our experimental observations for the Texas dataset.

B. Double Descent as a Function of the Relative Model Complexity.

As mentioned earlier, the theoretical model makes it easy to study double descent as we vary the model complexity $α = 1 / γ$ rather than $τ$ ; this is closer to the traditional reports of double descent in supervised learning. The resulting plots follow a similar logic: as shown in Fig. 5, adding randomness in the graph (low $| λ |$ ), makes the double descent more prominent. Conversely, for a highly homophilic graph (large $λ$ ), the test risk decreases monotonically as the relative model complexity $α$ grows.

C. Heterophily, Homophily, and Positive and Negative Self-Loops.

GCNs often perform worse on heterophilic than on homophilic graphs. An active line of research tries to understand and mitigate this phenomenon with special architectures and training strategies (12, 50, 51). We now show that it can be understood through the lens of self-loops.

Strong GCNs ubiquitously employ self-loops of the form $P (A) = A + I_{N}$ on homophilic graphs (8, 12, 41, 42, 52).^¶ Self-loops, however, deteriorate performance on heterophilic networks. CSBM is well suited to study this phenomenon since $λ$ allows us to transition between homophilic and heterophilic graphs.

We allow the self-loop strength $c$ to vary continuously so that the effective adjacency matrix becomes $A + c I_{N}$ . Importantly, we also allow $c$ to be negative (SI Appendix, 2). In Fig. 6 we plot the test risk as a function of $c$ for both positive and negative $c$ . We find that a negative self-loop ( $c < 0$ ) results in much better performance on heterophilic data ( $λ < 0$ ). We sketch a signal-processing interpretation of this phenomenon in SI Appendix, 4.

Fig. 6. — Train and test risks on CSBM for different intensities of self loops. (A) train and test risk for $τ = 0.8$ and $λ = - 1$ (heterophilic). (B) test risks for $γ = 0.8$ , $τ = 0.8$ , $μ = 0$ under different $λ$ . (C) training risk for different $μ$ when $τ = λ = 1$ . Each data point is averaged over $10$ independent trials with $N$ = 5,000, $r = 0$ , and $d = 30$ . We use the nonsymmetric binary adjacency matrix ensemble $A^{bn}$ . The solid lines are the theoretical results predicted by the replica method. In (B) we see that the optimal generalization performance requires adapting the self-loop intensity $c$ to the degree of homophily.

D. Negative Self-Loops in State-of-the-Art GCNs.

It is remarkable that this finding generalizes to complex state-of-the-art graph neural networks and datasets. We experiment with two common heterophilic benchmarks, Chameleon and Squirrel, first with a two-layer ReLU GCN. The default GCN (for example in pytorch-geometric) contains self-loops of the form $A + I$ ; we immediately observe in Fig. 7 that removing them improves performance on both datasets. We then make the intensity of the self-loop adjustable as a hyper-parameter and find that a negative self-loop with $c$ between $- 1.0$ and $- 0.5$ results in the highest accuracy on both datasets. It is notable that the best performance in the two-layer ReLU GCN with $c = - 0.5$ (76.29%) is already close to state-of-the-art results by the Feature Selection Graph Neural Network (FSGNN) (40) (78.27%). FSGNN uses a graph filter bank $B = {A^{k}, {(A + I)}^{k}}$ with careful normalization. Taking a cue from the above findings, we show that a simple addition of negative self-loop filters ${(A - 0.5 I)}^{k}$ to FSGNN results in a better performance (78.96%) than previous state of the art; see also Table 1.

Fig. 7. — Test accuracy (black) and test error (red) in node classification with GCNs on real heterophilic graphs with different self-loop intensities. We implement a two-layer ReLU GCN with $128$ hidden neurons and an additional self-loop with strength $c$ . Each setting is averaged over different training–test splits taken from ref. 11 (60% training, 20% validation, 20% test). The relatively large SD (vertical bars) is mainly due to the randomness of the splits. The randomness from model initialization and training is comparatively small. The optimal test accuracy for these two datasets is obtained for self-loop intensity $- 0.5 < c^{*} < - 1$ .

Table 1.

Comparison of test accuracy when negative self-loop is absent (first and third column) or present (second and fourth column)

	GCN ( $c = 0$ )	GCN ( $c^{*}$ )	FSGNN	FSGNN ( $c^{*}$ )
Chameleon	75.81 $\pm$ 1.69	76.29 $\pm$ 1.22	78.27 $\pm$ 1.28	78.96 $\pm$ 1.05
Squirrel	67.19 $\pm$ 1.48	68.62 $\pm$ 2.13	74.10 $\pm$ 1.89	74.34 $\pm$ 1.21

Open in a new tab

The datasets and splits are the same as Fig. 7.

4. Discussion

Before delving into the details of the analytical methods in Section 5 and conceptual connections between GNNs and spin glasses, we discuss the various interpretations of our results in the context of related work.

A. Related Work on Theory of GNNs.

Most theoretical work on GNNs addresses their expressivity (53, 54). A key result is that the common message-passing formulation is limited by the power of the Weisfeiler–Lehman graph isomorphism test (55). This is of great relevance for computational chemistry where one must discriminate between the different molecular structures (56), but it does not explain how the interaction between the graph structure and the node features leads to generalization. Indeed, simple architectures like GCNs are far from being universal approximators but they often achieve excellent performance on real problems with real data.

Existing studies of generalization in GNNs leverage complexity measures such as the Vapnik–Chervonenkis dimension (57–59) or the Rademacher complexity (21). While the resulting bounds sometimes predict coarse qualitative behavior, a precise characterization of relevant phenomena remains elusive. Even the more refined techniques like PAC-Bayes perform only marginally better (22). It is striking that only in rare cases do these bounds explicitly incorporate the interaction between the graph and the features (23). Our results show that understanding this interaction is crucial to understanding learning on graphs.

Indeed, recall that a standard practice in the design of GNNs is to build (generalized) filters from the adjacency matrix or the graph Laplacian and then use these filters to process data. But if the underlying graph is an Erdős–Rényi random graph, the induced filters will be of little use in learning. The key is thus to understand how much useful information the graph provides about the labels (and vice-versa), and in what way that information is complementary to that contained in the features.

B. A Statistical Mechanics Approach: Precise Analysis of Simple Models.

An alternative to the typically vacuous^# complexity-based risk bounds for graph neural networks (21–23) is to adopt a statistical mechanics perspective on learning; this is what we do here. Indeed, one key aspect of learning algorithms that is not easily captured by complexity measures of statistical learning theory is the emergence of qualitatively distinct phases of learning as one varies certain key “order parameters”. Such phase diagrams emerge naturally when one views machine learning models in terms of statistical mechanics of learning (26, 37).

Martin and Mahoney (26) demonstrate this elegantly by formulating what they call a very simple deep learning model, and showing that it displays distinct learning phases reminiscent of many realistic, complex models, despite abstracting away all but the essential “load-like” and “temperature-like” parameters. They argue that such parameters can be identified in machine learning models across the board.

The statistical mechanics paradigm requires one to commit to a specific model and do different calculations for different models (25), but it results in sharp characterizations of relevant phases of learning.

Important results within this paradigm, both rigorous and heuristic, were derived over the last decade for regularized least-squares (60–62), random-feature regression (17, 34, 49, 63), and noisy Gaussian mixture and spiked covariance models (64–66), using a variety of analytical techniques from statistical physics, high-dimensional probability, and random matrix theory. Not all of these works explicitly declare adherence to the statistical mechanics tradition. It nonetheless seems appropriate to categorize them thus since they provide precise analyses of learning in specific models in terms of a few order parameters.

Even though these papers study comparatively simple models, many key results only appeared in the last couple of years, motivated by the proliferation of over-parameterized models and advances in analytical techniques. One should make sure to work in the correct scaling of the various parameters (34); while this may complicate the analysis it leads to results which match the behavior of realistic machine learning systems. We extend these recent results by allowing the information to propagate on a graph; this gives rise to interesting phenomena of some relevance for the practitioners. In order to obtain precise results we similarly study simple graph networks, but we also show that the salient predictions closely match the behavior of state-of-the-art networks on real datasets. We precisely traced the connection between generalization, the interaction type (homophilic or heterophilic) and the parameters of the GCN architecture and the dataset for a specific graph model. Experiments show that the learned lessons apply to a broad class of GNNs and can be used constructively to improve the performance of state-of-the-art graph neural networks on heterophilic data.

Finally, let us mention that phenomenological characterizations of phase diagrams of risk are not the only way to apply tools from statistical mechanics and more broadly physics to machine learning and neural networks. These tools may help address a rather different set of “design” questions, as reviewed by Bahri et al. (67).

C. Negative Self-Loops in Other Graph Learning Models.

Recent theoretical work (19, 20) shows that optimal message passing in heterophilic datasets requires aggregating neighbor messages with a sign opposite from that of node-specific updates. Similarly, in earlier GCN architectures such as GraphSAGE (9), node and neighbor features are extracted using different trainable functions. This immediately allows the possibility of aggregating neighbors with an opposite sign in heterophilic settings. We show that self-loops with sign and strength depending on the degree of heterophily improve performance both in theory and in real state-of-the-art GCNs. The notion of self-loops in the context of GCNs usually indicates an explicit connection between a node and itself, $A \leftarrow A + I$ .

D. GCNs with a Few Labels Outperform Optimal Unsupervised Detection.

One interpretation of our results is that they quantify the value of labels in community detection, traditionally approached with unsupervised methods. These approaches are subject to fundamental information-theoretic detection limits which have drawn considerable attention over the last decade (64, 68, 69). The most challenging and most realistic high-dimensional setting is when the signal strength is comparable to that of the noise for both the graph and the features (24, 68, 70). The results of Deshpande et al. indicate that when $μ^{2} / γ + λ^{2} < 1$ , no unsupervised estimator can detect the latent structure $y$ from $A$ and $X$ (24). Our analysis shows that even a small fraction of revealed labels allow a simple GCN to break this unsupervised barrier.

In Fig. 8, we compare the accuracy of a one-layer GCN with unsupervised belief propagation (BP) (24). We first run BP with $μ = λ = γ = 1$ and record the achieved accuracy. We then plot the smallest training label ratio $τ$ for which the GCN achieves the same accuracy. We repeat this procedure for different feature SNRs $μ$ and graph SNRs $λ$ . The black solid line indicates the information-theoretic threshold for detecting the latent structure from $A$ and $X$ .

Earlier analyses of belief propagation in the SBM without features uncover a detectability phase transition (71). Our analysis shows that no such transition happens with GCNs. Indeed, our primary interest is in understanding GCNs, which are a general tool for a variety of problems, but unlike belief propagation, GCNs need not be near-optimal for community detection. For the optimal inference strategy, the phase transition may not be destroyed by revealing labels.

5. Generalization in GCNs via Statistical Physics

The optimization problem Eq. 4 has a unique minimizer as long as $r > 0$ . Since it is a linear least-squares problem in $w$ , we can write down a closed-form solution,

\begin{matrix} w^{*} = {(r I_{F} + {(P (A) X)}^{T} I_{train} P (A) X)}^{- 1} {(P (A) X)}^{T} I_{train} y, \end{matrix}

[11]

where

\begin{matrix} {(I_{train})}_{ij} = \{\begin{matrix} 1 & if i = j \in V_{train} \\ 0 & otherwise . \end{matrix} \end{matrix}

[12]

Analyzing generalization is, in principle, as simple as substituting the closed-form expression Eq. 11 into Eq. 5 and Eq. 6 and calculating the requisite averages. The procedure is, however, complicated by the interaction between the graph $A$ and the features $X$ and the fact that $A$ is a random binary adjacency matrix. Further, for a symmetric $A$ , $I_{train} P (A)$ is correlated with $I_{test} P (A)$ even in a shallow GCN (and certainly in a deep one).

A. The Statistical Physics Program.

We interpret the (scaled) loss function as an energy, or a Hamiltonian, $H (w ; A, X) = τ N L_{A, X} (w)$ . Corresponding to this Hamiltonian is the Gibbs measure over the weights $w$ ,

\begin{matrix} \begin{matrix} d P_{β} (w ; A, X) & = \frac{exp (- β H (w ; A, X)) d w}{Z_{β} (A, X)} \\ where Z_{β} (A, X) & = \int d w exp (- β H (w ; A, X)), \end{matrix} \end{matrix}

$β$ is the inverse temperature and $Z_{β}$ is the partition function. At infinite temperature ( $β \to 0$ ), the Gibbs measure is diffuse; as the temperature approaches zero $(β \to \infty)$ , it converges to an atomic measure concentrated on the unique solution of Eq. 4, $w^{*} = lim_{β \to \infty} \int w P_{β} (w ; A, X) d w .$ In this latter case, the partition function is similarly dominated by the minimum of the Hamiltonian. The expected loss can thus be computed from the free energy density $f_{β}$ ,

\begin{matrix} \begin{matrix} E_{A, X} [L_{A, X} (w^{*})] = & \frac{1}{τ} lim_{β \to \infty} f_{β} \\ where f_{β} : = & - lim_{N \to \infty} \frac{1}{N β} E_{A, X} ln Z_{β} (A, X) . \end{matrix} \end{matrix}

Since the quenched average $E ln Z_{β}$ is usually intractable, we apply the replica method (72) which allows us to take the expectation inside the logarithm and compute the annealed average,

\begin{matrix} E_{A, X} ln Z_{β} (A, X) = lim_{n \to 0} \frac{ln E_{A, X} Z_{β}^{n} (A, X)}{n} . \end{matrix}

The gist of the replica method is to compute $E_{A, X} Z_{β}^{n}$ for integer $n$ and then “pretend” that $n$ is real and take the limit $n \to 0$ . The computation for integer $n$ is facilitated by the fact that $Z_{β}^{n}$ normalizes the joint distribution of $n$ independent copies of $w$ , ${w^{a}}_{a = 1}^{n}$ . We obtain

\begin{array}{l} E_{A, X} Z_{β}^{n} (A, X) \\ = E_{A, X} {(Z_{β} (A, X))}_{1} \times \dots \times {(Z_{β} (A, X))}_{n} \\ = \int \prod_{a = 1}^{n} d w^{a} E_{A, X} \exp (\sum_{a = 1}^{n} (- β {‖ I_{​ t r a i n} A X w^{a} - I_{​ t r a i n} y ‖}_{2}^{2})) \\ \times \exp (- β τ r ‖ w^{a} ‖_{2}^{2}) . \end{array}

[13]

Instead of working with the product $A X$ , replica allows us to express the free energy density as a stationary point of a function where the dependence on $A$ and $X$ is separated (see SI Appendix, 1 for details),

\begin{matrix} \begin{matrix} f_{β} = & \frac{1}{β} extr lim_{n \to 0} lim_{N \to \infty} \frac{1}{nN} (E_{A} [c (P (A))] + E_{X} [e (X)]) \\ + D (m, p, q, \hat{m}, \hat{p}, \hat{q}) \\ = & \frac{1}{β} \underset{\begin{matrix} m, p, q \\ \hat{m}, \hat{p}, \hat{q} \end{matrix}}{extr} C (m, p, q) + E (\hat{m}, \hat{p}, \hat{q}) + D (m, p, q, \hat{m}, \hat{p}, \hat{q}), \end{matrix} \end{matrix}

[14]

where we defined $C \overset{def}{=} \frac{1}{nN} E_{A} [c (P (A))]$ , $E \overset{def}{=} \frac{1}{nN} E_{X} [e (X)]$ , which in the limit $N \to \infty$ , $n \to 0$ only depend on the so-called order parameters $m, p, q$ and $\hat{m}, \hat{p}, \hat{q}$ . The separation thus allows us to study the influence of the distribution of $A$ in isolation; we provide the details in SI Appendix, 1. The risks (called the observables in physics) can be obtained from $f_{β}$ .

B. Gaussian Adjacency Equivalence.

A challenge in computing the quantities in Eq. 13 and Eq. 14 is to average over the binary adjacency matrix $A$ . We argue that $f$ in Eq. 14 does not change if we instead average over the Gaussian ensemble with a correctly chosen mean and covariance. For a one-layer GCN ( $P (A) = A$ ), we show that replacing $E_{A^{bs}} c (P (A^{bs}))$ by $E_{A^{gn}} c (P (A^{gn}))$ will not change $f$ in Eq. 14 with $A^{gn}$ being a spiked nonsymmetric Gaussian random matrix,

\begin{matrix} A^{gn} = \frac{λ}{N} y y^{T} + Ξ^{gn}, \end{matrix}

[15]

with $Ξ^{gn}$ having i.i.d. centered normal entries with variance $1 / N$ . This substitution is inspired by the universality results for the disorder of spin glasses (73–75) and the universality of mutual information in CSBM (24). Deshpande et al. (24) showed that the binary adjacency matrix in the stochastic block model can be replaced by

\begin{matrix} A^{gs} = \frac{λ}{N} y y^{T} + Ξ^{gs}, \end{matrix}

[16]

where $Ξ^{gs} \in R^{N \times N}$ is a sample from the standard Gaussian orthogonal ensemble, without affecting the mutual information between $y$ (which they modeled as random) and ( $A, X$ ) when $N \to \infty$ and $d \to \infty$ .

Our claim refers to certain averages involving $A$ ; we record it as a conjecture since our derivations are based on the nonrigorous replica method. We first define four probability distributions:

$A^{bs}$ : The distribution of adjacency matrices in the undirected CSBM (cf. Eq. 7) scaled by $1 / \sqrt{d}$ , $\frac{1}{\sqrt{d}} A^{bs} \sim A^{bs}$ ;
$A^{bn}$ : the distribution of adjacency matrices in the directed CSBM (cf. Eq. 8), scaled by $1 / \sqrt{d}$ ;
$A^{gs}$ : the distribution of spiked Gaussian orthogonal ensemble (cf. Eq. 16);
$A^{gn}$ : the distribution of spiked Gaussian random matrices (cf. Eq. 15).

With these definitions in hand we can state

Conjecture 1.

Assume that $d$ scales with $N$ so that $1 / d \to 0$ and $d / N \to 0$ when $N \to \infty$ . Let $P (A)$ be a polynomial in $A$ used to define the GCN function in Eq. 3). It then holds that

$\begin{matrix} R_{train} (A^{b ∙}) & = R_{train} (A^{g ∙}), \\ R_{test} (A^{b ∙}) & = R_{test} (A^{g ∙}), \\ ACC (A^{b ∙}) & = ACC (A^{g ∙}), \end{matrix}$

with $∙ \in {s, n}$ . When $P (A) = A$ , the above quantities for symmetric and nonsymmetric distributions also coincide.

In the case when $P (A) = A$ we justify Conjecture 1. by the replica method (SI Appendix, 1). In the general case we provide abundant numerical evidence in Fig. 9. We first consider the case when $P (A) = A$ . Fig. 9 A and D show estimates of $R_{train}$ and $R_{test}$ averaged over $100$ independent runs. The SD over independent runs is indicated by the shading. We see that the means converge and the variance shrinks as $N$ grows.

We also show absolute differences between the averages of $R_{train}$ and $R_{test}$ in Fig. 9 B and E. We find that the values of $R_{train}$ and $R_{test}$ can be well fitted by a linear relationship in the logarithmic scale, suggesting that the absolute differences approach zero exponentially fast as $N \to \infty$ . We next consider $P (A) = A^{2}$ . In Fig. 9 C and F we can see that for intermediate values of $d$ , $R_{train}$ and $R_{test}$ corresponding to $A^{bs}$ and $A^{bn}$ are both close to that corresponding to $A^{gs}$ and $A^{gn}$ . This is consistent with the results shown in Figs. 3 and 6 where the theoretical results computed by the replica method and $A^{gn}$ perfectly match the numerical results with $A^{bn}$ (for $P (A) = A + c I_{N}$ ) and $A^{bs}$ (for $P (A) = A$ ), further validating the conjecture.

C. Solution to the Saddle Point Equation.

We can now solve the saddle point Eq. 14 by averaging over $A^{gn}$ . In the general case the solution is easy to obtain numerically. For an one-layer GCN with $P (A) = A$ we can compute a closed-form solution. Denoting the critical point in Eq. 14 by $(m^{*}, p^{*}, q^{*})$ we obtain

\begin{matrix} R_{train} & = \frac{{(λ m^{*} - 1)}^{2} + p^{*}}{{(2 q^{*} + 1)}^{2}}, \\ R_{test} & = {(λ m^{*} - 1)}^{2} + p^{*}, \\ ACC & = \frac{1}{2} (1 + erf (\frac{λ m^{*}}{\sqrt{2 p^{*}}})), \end{matrix}

[17]

where $erf$ is the usual error function. While the general expressions are complicated (SI Appendix, 1), in the ridgeless limit $r \to 0$ we can compute simple closed-form expressions for train and test risks,

\begin{matrix} \begin{matrix} R_{train} = & \frac{(γ + μ) (γ τ - 1)}{γ τ (γ + λ^{2} (μ + 1) + μ)}, \\ R_{test} = & \frac{γ τ (γ + μ)}{(γ τ - 1) (γ + λ^{2} (μ + 1) + μ)}, \end{matrix} \end{matrix}

[18]

assuming that $τ γ > 1$ .

D. A Rigorous Solution.

We note that for a one-layer GCN risks can be computed rigorously using random matrix theory provided that Conjecture 1. holds and we begin with a Gaussian “adjacency matrix” instead of the true binary SBM adjacency matrix. We outline this approach in SI Appendix, 3; in particular, for $r = 0$ , the result of course coincides with that in Eq. 18.

6. Conclusion

We analyzed generalization in graph neural networks by making an analogy with a system of interacting particles: particles correspond to the data points and the interactions are specified by the adjacency relation and the learnable weights. The latter can be interpreted as defining the “interaction physics” of the problem. The best weights correspond to the most plausible interaction physics, coupled in turn with the network formation mechanism.

The setting that we analyzed is maybe the simplest combination of a graph convolution network and data distribution which exhibits interesting, realistic behavior. In order to theoretically capture a broader spectrum of complexity in graph learning we need to work on new ideas in random matrix theory and its neural network counterparts (76). While very deep GCNs are known to suffer from oversmoothing, there exists an interesting intermediate-depth regime beyond a single layer (77). Our techniques should apply simply by replacing $A$ by any polynomial $P (A)$ before solving the saddle point equation, but we will need a generalization of existing random matrix theory results for HCIZ integrals. Finally, it is likely that these generalized results could be made fully rigorous if “universality” in Conjecture 1. could be established formally.

Supplementary Material

Appendix 01 (PDF)

pnas.2309504121.sapp.pdf^{(1.3MB, pdf)}

Acknowledgments

We thank the anonymous reviewers for suggestions on how to improve presentation. Cheng Shi and Liming Pan would like to thank Zhenyu Liao (HUST) and Ming Li (ZJNU) for valuable discussions about random matrix theory. Cheng Shi and Ivan Dokmanić were supported by the European Research Council Starting Grant 852821—SWING. Liming Pan would like to acknowledge support from National Natural Science Foundation of China under Grant No. 62006122 and 42230406.

Author contributions

C.S., L.P., H.H., and I.D. designed research and contributed new analytic tools; C.S. and I.D. performed research, analyzed data and wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

^*More precisely, transductive.

^†It may be interesting to note that papers by the physicists from the 90s put the amount of data on the abscissa (25, 29).

^‡https://pytorch-geometric.readthedocs.io/en/latest/notes/colabs.html.

^§GCNs without activations are sometimes called “linear” in analogy with feed-forward networks, but that terminology is misleading. In graph learning, both $A$ and $X$ are bona fide parts of the input and a function which depends on their multiplication is a nonlinear function. What is more, in many applications $A$ is constructed deterministically from a dataset $X$ , for example as a neighborhood graph, resulting in even stronger nonlinearity.

^¶One way to characterize link semantics in graphs is by notions of homophily and heterophily. In a friendship graph links signify similarity: If Alice and Bob both know Jane it is reasonable to expect that Alice and Bob also know each other. In a protein interaction graph, if proteins A and B interact, a small mutation A’ of A will likely still interact with B but not with A. Thus “interaction” links signify partition. Most graphs are somewhere in between the homophilic and heterophilic extremes.

^#We quote the authors of the PAC-Bayesian analysis of generalization in GNNs (22): “[...] we are far from being able to explain the practical behaviors of GNNs.”

Contributor Information

Liming Pan, Email: panlm99@gmail.com.

Ivan Dokmanić, Email: ivan.dokmanic@unibas.ch.

Data, Materials, and Software Availability

Previously published data were used for this work (11, 38, 39).

Supporting Information

References

1.R. Lam et al., GraphCast: Learning skillful medium-range global weather forecasting. arXiv [Preprint] (2022). http://arxiv.org/abs/2212.12794 (Accessed 4 August 2023). [DOI] [PubMed]
2.Mandal R., Casert C., Sollich P., Robust prediction of force chains in jammed solids using graph neural networks. Nat. Commun. 13, 4424 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ingraham J., Garg V., Barzilay R., Jaakkola T., “Generative models for graph-based protein design” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds. (Curran Associates, Inc., New York, NY, 2019), vol. 32.
4.Gligorijević V., et al. , Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.J. E. Bruna, W. Zaremba, A. Szlam, Y. LeCun, “Spectral networks and deep locally connected networks on graphs” in International Conference on Learning Representations (OpenReview.net, 2014).
7.Defferrard M., Bresson X., Vandergheynst P., “Convolutional neural networks on graphs with fast localized spectral filtering” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett, Eds. (Curran Associates, Inc., New York, NY, 2016), vol. 29.
8.T. N. Kipf, M. Welling, “Semi-supervised classification with graph convolutional networks” in International Conference on Learning Representations (OpenReview.net, 2017).
9.Hamilton W., Ying Z., Leskovec J., “Inductive representation learning on large graphs” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds. (Curran Associates, Inc., New York, NY, 2017), vol. 30.
10.Zhu J., et al. , Beyond homophily in graph neural networks: Current limitations and effective designs. Adv. Neural Inf. Process. Syst. 33, 7793–7804 (2020). [Google Scholar]
11.H. Pei, B. Wei, K. C. C. Chang, Y. Lei, B. Yang, “Geom-GCN: Geometric graph convolutional networks” in International Conference on Learning Representations (OpenReview.net, 2020).
12.E. Chien, J. Peng, P. Li, O. Milenkovic, “Adaptive universal generalized PageRank graph neural network” in International Conference on Learning Representations (OpenReview.net, 2021).
13.K. Oono, T. Suzuki, “Graph neural networks exponentially lose expressive power for node classification” in International Conference on Learning Representations (OpenReview.net, 2020).
14.Nakkiran P., et al. , Deep double descent: Where bigger models and more data hurt. J. Stat. Mech.: Theory Exp. 2021, 124003 (2021). [Google Scholar]
15.Liu Y. Y., Slotine J. J., Barabási A. L., Observability of complex systems. Proc. Natl. Acad. Sci. U.S.A. 110, 2460–2465 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Chen L., Min Y., Belkin M., Karbasi A., Multiple descent: Design your own generalization curve. Adv. Neural Inf. Process. Syst. 34, 8898–8912 (2021). [Google Scholar]
17.Belkin M., Hsu D., Xu J., Two models of double descent for weak features. SIAM J. Math. Data Sci. 2, 1167–1180 (2020). [Google Scholar]
18.McPherson M., Smith-Lovin L., Cook J. M., Birds of a feather: Homophily in social networks. Annu. Rev. Sociol. 27, 415–444 (2001). [Google Scholar]
19.R. Wei, H. Yin, J. Jia, A. R. Benson, P. Li, Understanding non-linearity in graph neural networks from the Bayesian-inference perspective. arXiv [Preprint] (2022). http://arxiv.org/abs/2207.11311.
20.A. Baranwal, A. Jagannath, K. Fountoulakis, Optimality of message-passing architectures for sparse graphs. arXiv [Preprint] (2023). http://arxiv.org/abs/2305.10391 (Accessed 17 May 2023).
21.V. Garg, S. Jegelka, T. Jaakkola, “Generalization and representational limits of graph neural networks” in International Conference on Machine Learning, H. Daumé III, A. Singh, Eds. (PMLR, 2020), pp. 3419–3430.
22.R. Liao, R. Urtasun, R. Zemel, “A PAC-Bayesian approach to generalization bounds for graph neural networks” in International Conference on Learning Representations (OpenReview.net, 2021).
23.Esser P., Vankadara L. C., Ghoshdastidar D., Learning theory can (sometimes) explain generalisation in graph neural networks. Adv. Neural Inf. Process. Syst. 34, 27043–27056 (2021). [Google Scholar]
24.Deshpande Y., Sen S., Montanari A., Mossel E., “Contextual stochastic block models” in Advances in Neural Information Processing Systems, S. Bengio et al., Eds. (Curran Associates, Inc., New York, NY, 2018), vol. 31.
25.Watkin T. L., Rau A., Biehl M., The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499 (1993). [Google Scholar]
26.C. H. Martin, M. W. Mahoney, Rethinking generalization requires revisiting old ideas: Statistical mechanics approaches and complex learning behavior. arXiv [Preprint] (2017). http://arxiv.org/abs/1710.09553 (Accessed 17 February 2019).
27.C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, “Understanding deep learning requires rethinking generalization” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (OpenReview.net, 2017).
28.Hastie T., Tibshirani R., Friedman J. H., Friedman J. H., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009), vol. 2. [Google Scholar]
29.Opper M., Kinzel W., Kleinz J., Nehl R., On the ability of the optimal perceptron to generalise. J. Phys. A: Math. General 23, L581 (1990). [Google Scholar]
30.Engel A., Van den Broeck C., Statistical Mechanics of Learning (Cambridge University Press, 2001). [Google Scholar]
31.Seung H. S., Sompolinsky H., Tishby N., Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056 (1992). [DOI] [PubMed] [Google Scholar]
32.Opper M., Learning and generalization in a two-layer neural network: The role of the Vapnik-Chervonvenkis dimension. Phys. Rev. Lett. 72, 2113 (1994). [DOI] [PubMed] [Google Scholar]
33.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Liao Z., Couillet R., Mahoney M. W., A random matrix analysis of random Fourier features: Beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. Adv. Neural Inf. Process. Syst. 33, 13939–13950 (2020). [Google Scholar]
35.Canatar A., Bordelon B., Pehlevan C., Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 2914 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Q. Li, Z. Han, X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning” in Proceedings of the AAAI Conference on Artificial Intelligence (2018), vol. 32.
37.Yang Y., et al. , Taxonomizing local versus global structure in neural network loss landscapes. Adv. Neural Inf. Process. Syst. 34, 18722–18733 (2021). [Google Scholar]
38.Sen P., et al. , Collective classification in network data. AI Magaz. 29, 93 (2008). [Google Scholar]
39.Rozemberczki B., Allen C., Sarkar R., Multi-scale attributed node embedding. J. Compl. Netw. 9, cnab014 (2021). [Google Scholar]
40.S. K. Maurya, X. Liu, T. Murata, Improving graph neural networks with simple architecture design. arXiv [Preprint] (2021). http://arxiv.org/abs/2105.07634 (Accessed 17 May 2021).
41.P. Veličković et al., “Graph attention networks” in International Conference on Learning Representations (OpenReview.net, 2018).
42.F. Wu et al., “Simplifying graph convolutional networks” in International Conference on Machine Learning, K. Chaudhuri, R. Salakhutdinov, Eds. (PMLR, 2019), pp. 6861–6871.
43.X. Wang, M. Zhang, “How powerful are spectral graph neural networks” in International Conference on Machine Learning, K. Chaudhuri et al., Eds. (PMLR, 2022), pp. 23341–23362.
44.He M., et al. , Bernnet: Learning arbitrary graph spectral filters via Bernstein approximation. Adv. Neural Inf. Process. Syst. 34, 14239–14251 (2021). [Google Scholar]
45.Wang Y. J., Wong G. Y., Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987). [Google Scholar]
46.Malliaros F. D., Vazirgiannis M., Clustering and community detection in directed networks: A survey. Phys. Rep. 533, 95–142 (2013). [Google Scholar]
47.W. Lu, “Learning guarantees for graph convolutional networks on the stochastic block model” in International Conference on Learning Representations (OpenReview.net, 2021).
48.A. Baranwal, K. Fountoulakis, A. Jagannath, “Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization” in International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR, 2021), pp. 684–693.
49.Mei S., Montanari A., The generalization error of random features regression: Precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2022). [Google Scholar]
50.X. Li et al., “Finding global homophily in graph neural networks when meeting heterophily” in International Conference on Machine Learning (PMLR, 2022), pp. 13242–13256.
51.Luan S., et al. , Revisiting heterophily for graph neural networks. Adv. Neural Inf. Process. Syst. 35, 1362–1375 (2022). [Google Scholar]
52.J. Gasteiger, A. Bojchevski, S. Günnemann, Predict then propagate: Graph neural networks meet personalized pagerank. arXiv [Preprint] (2018). http://arxiv.org/abs/1810.05997.
53.R. Sato, A. survey on the expressive power of graph neural networks. arXiv [Preprint] (2020). http://arxiv.org/abs/2003.04078 (Accessed 16 October 2020).
54.F. Geerts, J. L. Reutter, “Expressiveness and approximation properties of graph neural networks” in International Conference on Learning Representations (OpenReview.net, 2022).
55.K. Xu, W. Hu, J. Leskovec, S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations (OpenReview.net, 2019).
56.J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, “Neural message passing for quantum chemistry” in International Conference on Machine Learning, D. Precup, Y. W. Teh, Eds. (PMLR, 2017), pp. 1263–1272.
57.Vapnik V., Chervonenkis A. Y., On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971). [Google Scholar]
58.Vapnik V., The Nature of Statistical Learning Theory (Springer Science & Business Media, 1999). [Google Scholar]
59.Scarselli F., Tsoi A. C., Hagenbuchner M., The Vapnik-Chervonenkis dimension of graph and recursive neural networks. Neural Netw. 108, 248–259 (2018). [DOI] [PubMed] [Google Scholar]
60.S. Oymak, C. Thrampoulidis, B. Hassibi, “The squared-error of generalized lasso: A precise analysis” in 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton) (IEEE, 2013), pp. 1002–1009.
61.Thrampoulidis C., Abbasi E., Hassibi B., Precise error analysis of regularized $M$ -estimators in high dimensions. IEEE Trans. Inf. Theory 64, 5592–5628 (2018). [Google Scholar]
62.Boyd S., et al. , Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011). [Google Scholar]
63.Hu H., Lu Y. M., Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory 69, 1932–1964 (2022). [Google Scholar]
64.A. El Alaoui, M. I. Jordan, “Detection limits in the high-dimensional spiked rectangular model” in Conference on Learning Theory, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, 2018), pp. 410–438.
65.Barbier J., Macris N., Rush C., All-or-nothing statistical and computational phase transitions in sparse spiked matrix estimation. Adv. Neural Inf. Process. Syst. 33, 14915–14926 (2020). [Google Scholar]
66.F. Mignacco, F. Krzakala, Y. Lu, P. Urbani, L. Zdeborová, “The role of regularization in classification of high-dimensional noisy Gaussian mixture” in International Conference on Machine Learning, H. Daumé III, A. Singh, Eds. (PMLR, 2020), pp. 6874–6883.
67.Bahri Y., et al. , Statistical mechanics of deep learning. Annu. Rev. Condens. Matter Phys. 11, 501–528 (2020). [Google Scholar]
68.Deshpande Y., Abbe E., Montanari A., Asymptotic mutual information for the balanced binary stochastic block model. Inf. Inference: A J. IMA 6, 125–170 (2017). [Google Scholar]
69.Mossel E., Neeman J., Sly A., A proof of the block model threshold conjecture. Combinatorica 38, 665–708 (2018). [Google Scholar]
70.O. Duranthon, L. Zdeborová, Optimal inference in contextual stochastic block models. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.07948 (Accessed 6 July 2023).
71.Zhang P., Moore C., Zdeborová L., Phase transitions in semisupervised clustering of sparse networks. Phys. Rev. E 90, 052802 (2014). [DOI] [PubMed] [Google Scholar]
72.Mézard M., Parisi G., Virasoro M. A., Spin Glass Theory and Beyond: An Introduction to the Replica Method and its Applications (World Scientific Publishing Company, 1987), vol. 9. [Google Scholar]
73.Talagrand M., Gaussian averages, Bernoulli averages, and Gibbs’ measures. Random Struct. Algorith. 21, 197–204 (2002). [Google Scholar]
74.Carmona P., Hu Y., “Universality in Sherrington-Kirkpatrick’s spin glass model” in Annales de l’Institut Henri Poincare (B) Probability and Statistics, (Elsevier, 2006), vol. 42, pp. 215–222. [Google Scholar]
75.Panchenko D., The Sherrington-Kirkpatrick Model (Springer Science & Business Media, 2013). [Google Scholar]
76.J. Pennington, P. Worah, “Nonlinear random matrix theory for deep learning” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds. (Curran Associates, Inc., New York, NY, 2017), vol. 30.
77.N. Keriven, “Not too little, not too much: A theoretical analysis of graph (over)smoothing” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, K. Cho, Eds. (Curran Associates, Inc., New York, NY, 2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2309504121.sapp.pdf^{(1.3MB, pdf)}

Data Availability Statement

Previously published data were used for this work (11, 38, 39).

[r1] 1.R. Lam et al., GraphCast: Learning skillful medium-range global weather forecasting. arXiv [Preprint] (2022). http://arxiv.org/abs/2212.12794 (Accessed 4 August 2023). [DOI] [PubMed]

[r2] 2.Mandal R., Casert C., Sollich P., Robust prediction of force chains in jammed solids using graph neural networks. Nat. Commun. 13, 4424 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Ingraham J., Garg V., Barzilay R., Jaakkola T., “Generative models for graph-based protein design” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds. (Curran Associates, Inc., New York, NY, 2019), vol. 32.

[r4] 4.Gligorijević V., et al. , Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.J. E. Bruna, W. Zaremba, A. Szlam, Y. LeCun, “Spectral networks and deep locally connected networks on graphs” in International Conference on Learning Representations (OpenReview.net, 2014).

[r7] 7.Defferrard M., Bresson X., Vandergheynst P., “Convolutional neural networks on graphs with fast localized spectral filtering” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett, Eds. (Curran Associates, Inc., New York, NY, 2016), vol. 29.

[r8] 8.T. N. Kipf, M. Welling, “Semi-supervised classification with graph convolutional networks” in International Conference on Learning Representations (OpenReview.net, 2017).

[r9] 9.Hamilton W., Ying Z., Leskovec J., “Inductive representation learning on large graphs” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds. (Curran Associates, Inc., New York, NY, 2017), vol. 30.

[r10] 10.Zhu J., et al. , Beyond homophily in graph neural networks: Current limitations and effective designs. Adv. Neural Inf. Process. Syst. 33, 7793–7804 (2020). [Google Scholar]

[r11] 11.H. Pei, B. Wei, K. C. C. Chang, Y. Lei, B. Yang, “Geom-GCN: Geometric graph convolutional networks” in International Conference on Learning Representations (OpenReview.net, 2020).

[r12] 12.E. Chien, J. Peng, P. Li, O. Milenkovic, “Adaptive universal generalized PageRank graph neural network” in International Conference on Learning Representations (OpenReview.net, 2021).

[r13] 13.K. Oono, T. Suzuki, “Graph neural networks exponentially lose expressive power for node classification” in International Conference on Learning Representations (OpenReview.net, 2020).

[r14] 14.Nakkiran P., et al. , Deep double descent: Where bigger models and more data hurt. J. Stat. Mech.: Theory Exp. 2021, 124003 (2021). [Google Scholar]

[r15] 15.Liu Y. Y., Slotine J. J., Barabási A. L., Observability of complex systems. Proc. Natl. Acad. Sci. U.S.A. 110, 2460–2465 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Chen L., Min Y., Belkin M., Karbasi A., Multiple descent: Design your own generalization curve. Adv. Neural Inf. Process. Syst. 34, 8898–8912 (2021). [Google Scholar]

[r17] 17.Belkin M., Hsu D., Xu J., Two models of double descent for weak features. SIAM J. Math. Data Sci. 2, 1167–1180 (2020). [Google Scholar]

[r18] 18.McPherson M., Smith-Lovin L., Cook J. M., Birds of a feather: Homophily in social networks. Annu. Rev. Sociol. 27, 415–444 (2001). [Google Scholar]

[r19] 19.R. Wei, H. Yin, J. Jia, A. R. Benson, P. Li, Understanding non-linearity in graph neural networks from the Bayesian-inference perspective. arXiv [Preprint] (2022). http://arxiv.org/abs/2207.11311.

[r20] 20.A. Baranwal, A. Jagannath, K. Fountoulakis, Optimality of message-passing architectures for sparse graphs. arXiv [Preprint] (2023). http://arxiv.org/abs/2305.10391 (Accessed 17 May 2023).

[r21] 21.V. Garg, S. Jegelka, T. Jaakkola, “Generalization and representational limits of graph neural networks” in International Conference on Machine Learning, H. Daumé III, A. Singh, Eds. (PMLR, 2020), pp. 3419–3430.

[r22] 22.R. Liao, R. Urtasun, R. Zemel, “A PAC-Bayesian approach to generalization bounds for graph neural networks” in International Conference on Learning Representations (OpenReview.net, 2021).

[r23] 23.Esser P., Vankadara L. C., Ghoshdastidar D., Learning theory can (sometimes) explain generalisation in graph neural networks. Adv. Neural Inf. Process. Syst. 34, 27043–27056 (2021). [Google Scholar]

[r24] 24.Deshpande Y., Sen S., Montanari A., Mossel E., “Contextual stochastic block models” in Advances in Neural Information Processing Systems, S. Bengio et al., Eds. (Curran Associates, Inc., New York, NY, 2018), vol. 31.

[r25] 25.Watkin T. L., Rau A., Biehl M., The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499 (1993). [Google Scholar]

[r26] 26.C. H. Martin, M. W. Mahoney, Rethinking generalization requires revisiting old ideas: Statistical mechanics approaches and complex learning behavior. arXiv [Preprint] (2017). http://arxiv.org/abs/1710.09553 (Accessed 17 February 2019).

[r27] 27.C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, “Understanding deep learning requires rethinking generalization” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (OpenReview.net, 2017).

[r28] 28.Hastie T., Tibshirani R., Friedman J. H., Friedman J. H., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009), vol. 2. [Google Scholar]

[r29] 29.Opper M., Kinzel W., Kleinz J., Nehl R., On the ability of the optimal perceptron to generalise. J. Phys. A: Math. General 23, L581 (1990). [Google Scholar]

[r30] 30.Engel A., Van den Broeck C., Statistical Mechanics of Learning (Cambridge University Press, 2001). [Google Scholar]

[r31] 31.Seung H. S., Sompolinsky H., Tishby N., Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056 (1992). [DOI] [PubMed] [Google Scholar]

[r32] 32.Opper M., Learning and generalization in a two-layer neural network: The role of the Vapnik-Chervonvenkis dimension. Phys. Rev. Lett. 72, 2113 (1994). [DOI] [PubMed] [Google Scholar]

[r33] 33.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r34] 34.Liao Z., Couillet R., Mahoney M. W., A random matrix analysis of random Fourier features: Beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. Adv. Neural Inf. Process. Syst. 33, 13939–13950 (2020). [Google Scholar]

[r35] 35.Canatar A., Bordelon B., Pehlevan C., Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 2914 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r36] 36.Q. Li, Z. Han, X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning” in Proceedings of the AAAI Conference on Artificial Intelligence (2018), vol. 32.

[r37] 37.Yang Y., et al. , Taxonomizing local versus global structure in neural network loss landscapes. Adv. Neural Inf. Process. Syst. 34, 18722–18733 (2021). [Google Scholar]

[r38] 38.Sen P., et al. , Collective classification in network data. AI Magaz. 29, 93 (2008). [Google Scholar]

[r39] 39.Rozemberczki B., Allen C., Sarkar R., Multi-scale attributed node embedding. J. Compl. Netw. 9, cnab014 (2021). [Google Scholar]

[r40] 40.S. K. Maurya, X. Liu, T. Murata, Improving graph neural networks with simple architecture design. arXiv [Preprint] (2021). http://arxiv.org/abs/2105.07634 (Accessed 17 May 2021).

[r41] 41.P. Veličković et al., “Graph attention networks” in International Conference on Learning Representations (OpenReview.net, 2018).

[r42] 42.F. Wu et al., “Simplifying graph convolutional networks” in International Conference on Machine Learning, K. Chaudhuri, R. Salakhutdinov, Eds. (PMLR, 2019), pp. 6861–6871.

[r43] 43.X. Wang, M. Zhang, “How powerful are spectral graph neural networks” in International Conference on Machine Learning, K. Chaudhuri et al., Eds. (PMLR, 2022), pp. 23341–23362.

[r44] 44.He M., et al. , Bernnet: Learning arbitrary graph spectral filters via Bernstein approximation. Adv. Neural Inf. Process. Syst. 34, 14239–14251 (2021). [Google Scholar]

[r45] 45.Wang Y. J., Wong G. Y., Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987). [Google Scholar]

[r46] 46.Malliaros F. D., Vazirgiannis M., Clustering and community detection in directed networks: A survey. Phys. Rep. 533, 95–142 (2013). [Google Scholar]

[r47] 47.W. Lu, “Learning guarantees for graph convolutional networks on the stochastic block model” in International Conference on Learning Representations (OpenReview.net, 2021).

[r48] 48.A. Baranwal, K. Fountoulakis, A. Jagannath, “Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization” in International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR, 2021), pp. 684–693.

[r49] 49.Mei S., Montanari A., The generalization error of random features regression: Precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2022). [Google Scholar]

[r50] 50.X. Li et al., “Finding global homophily in graph neural networks when meeting heterophily” in International Conference on Machine Learning (PMLR, 2022), pp. 13242–13256.

[r51] 51.Luan S., et al. , Revisiting heterophily for graph neural networks. Adv. Neural Inf. Process. Syst. 35, 1362–1375 (2022). [Google Scholar]

[r52] 52.J. Gasteiger, A. Bojchevski, S. Günnemann, Predict then propagate: Graph neural networks meet personalized pagerank. arXiv [Preprint] (2018). http://arxiv.org/abs/1810.05997.

[r53] 53.R. Sato, A. survey on the expressive power of graph neural networks. arXiv [Preprint] (2020). http://arxiv.org/abs/2003.04078 (Accessed 16 October 2020).

[r54] 54.F. Geerts, J. L. Reutter, “Expressiveness and approximation properties of graph neural networks” in International Conference on Learning Representations (OpenReview.net, 2022).

[r55] 55.K. Xu, W. Hu, J. Leskovec, S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations (OpenReview.net, 2019).

[r56] 56.J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, “Neural message passing for quantum chemistry” in International Conference on Machine Learning, D. Precup, Y. W. Teh, Eds. (PMLR, 2017), pp. 1263–1272.

[r57] 57.Vapnik V., Chervonenkis A. Y., On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971). [Google Scholar]

[r58] 58.Vapnik V., The Nature of Statistical Learning Theory (Springer Science & Business Media, 1999). [Google Scholar]

[r59] 59.Scarselli F., Tsoi A. C., Hagenbuchner M., The Vapnik-Chervonenkis dimension of graph and recursive neural networks. Neural Netw. 108, 248–259 (2018). [DOI] [PubMed] [Google Scholar]

[r60] 60.S. Oymak, C. Thrampoulidis, B. Hassibi, “The squared-error of generalized lasso: A precise analysis” in 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton) (IEEE, 2013), pp. 1002–1009.

[r61] 61.Thrampoulidis C., Abbasi E., Hassibi B., Precise error analysis of regularized $M$ -estimators in high dimensions. IEEE Trans. Inf. Theory 64, 5592–5628 (2018). [Google Scholar]

[r62] 62.Boyd S., et al. , Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011). [Google Scholar]

[r63] 63.Hu H., Lu Y. M., Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory 69, 1932–1964 (2022). [Google Scholar]

[r64] 64.A. El Alaoui, M. I. Jordan, “Detection limits in the high-dimensional spiked rectangular model” in Conference on Learning Theory, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, 2018), pp. 410–438.

[r65] 65.Barbier J., Macris N., Rush C., All-or-nothing statistical and computational phase transitions in sparse spiked matrix estimation. Adv. Neural Inf. Process. Syst. 33, 14915–14926 (2020). [Google Scholar]

[r66] 66.F. Mignacco, F. Krzakala, Y. Lu, P. Urbani, L. Zdeborová, “The role of regularization in classification of high-dimensional noisy Gaussian mixture” in International Conference on Machine Learning, H. Daumé III, A. Singh, Eds. (PMLR, 2020), pp. 6874–6883.

[r67] 67.Bahri Y., et al. , Statistical mechanics of deep learning. Annu. Rev. Condens. Matter Phys. 11, 501–528 (2020). [Google Scholar]

[r68] 68.Deshpande Y., Abbe E., Montanari A., Asymptotic mutual information for the balanced binary stochastic block model. Inf. Inference: A J. IMA 6, 125–170 (2017). [Google Scholar]

[r69] 69.Mossel E., Neeman J., Sly A., A proof of the block model threshold conjecture. Combinatorica 38, 665–708 (2018). [Google Scholar]

[r70] 70.O. Duranthon, L. Zdeborová, Optimal inference in contextual stochastic block models. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.07948 (Accessed 6 July 2023).

[r71] 71.Zhang P., Moore C., Zdeborová L., Phase transitions in semisupervised clustering of sparse networks. Phys. Rev. E 90, 052802 (2014). [DOI] [PubMed] [Google Scholar]

[r72] 72.Mézard M., Parisi G., Virasoro M. A., Spin Glass Theory and Beyond: An Introduction to the Replica Method and its Applications (World Scientific Publishing Company, 1987), vol. 9. [Google Scholar]

[r73] 73.Talagrand M., Gaussian averages, Bernoulli averages, and Gibbs’ measures. Random Struct. Algorith. 21, 197–204 (2002). [Google Scholar]

[r74] 74.Carmona P., Hu Y., “Universality in Sherrington-Kirkpatrick’s spin glass model” in Annales de l’Institut Henri Poincare (B) Probability and Statistics, (Elsevier, 2006), vol. 42, pp. 215–222. [Google Scholar]

[r75] 75.Panchenko D., The Sherrington-Kirkpatrick Model (Springer Science & Business Media, 2013). [Google Scholar]

[r76] 76.J. Pennington, P. Worah, “Nonlinear random matrix theory for deep learning” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds. (Curran Associates, Inc., New York, NY, 2017), vol. 30.

[r77] 77.N. Keriven, “Not too little, not too much: A theoretical analysis of graph (over)smoothing” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, K. Cho, Eds. (Curran Associates, Inc., New York, NY, 2022).

PERMALINK

Homophily modulates double descent generalization in graph convolution networks

Cheng Shi

Liming Pan

Hong Hu

Ivan Dokmanić

Significance

Abstract

Fig. 5.

Outline.

1. Motivation: Empirical Results

A. Is Double Descent Absent in GNNs?

B. Generalization in Supervised vs. Transductive Learning.

C. Experimental Observation of Double Descent in GNNs.

Fig. 1.

Fig. 4.

Fig. 2.

Fig. 3.

2. A Precise Analysis of Node Classification on CSBM with a Simple Graph Convolution Network

A. Training and Generalization.

B. Contextual Stochastic Block Model.

3. Phenomenology of Generalization in GCNs

A. Double Descent in Shallow GCNs on CSBM.

B. Double Descent as a Function of the Relative Model Complexity.

C. Heterophily, Homophily, and Positive and Negative Self-Loops.

Fig. 6.

D. Negative Self-Loops in State-of-the-Art GCNs.

Fig. 7.

Table 1.

4. Discussion

A. Related Work on Theory of GNNs.

B. A Statistical Mechanics Approach: Precise Analysis of Simple Models.

C. Negative Self-Loops in Other Graph Learning Models.

D. GCNs with a Few Labels Outperform Optimal Unsupervised Detection.

Fig. 8.

5. Generalization in GCNs via Statistical Physics

A. The Statistical Physics Program.

B. Gaussian Adjacency Equivalence.

Conjecture 1.

Fig. 9.

C. Solution to the Saddle Point Equation.

D. A Rigorous Solution.

6. Conclusion

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases