Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2020 Jan 6;9(4):749–783. doi: 10.1093/imaiai/iaz031

Matchability of heterogeneous networks pairsInline graphic

Vince Lyzinski 1,, Daniel L Sussman 2
PMCID: PMC7737166  PMID: 33343893

Abstract

We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using universal singular value thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.

Keywords: graph matching, random graphs, singular value thresholding

1. Introduction and background

The graph matching problem (GMP) seeks to find an alignment between the vertex sets of two graphs that best preserves common structure across graphs. At its simplest, it can be formulated as follows: given two adjacency matrices Inline graphic and Inline graphic corresponding to Inline graphic-vertex graphs, the GMP seeks to minimize Inline graphic over permutation matrices Inline graphic; i.e., the GMP seeks a relabelling of the vertices of Inline graphic that minimizes the number of induced edge disagreements between Inline graphic and Inline graphic. Variants and extensions of this problem have been extensively studied in the literature, with applications across areas as diverse as biology and neuroscience [10,18,30,58], computer vision [20,35,45], pattern recognition [33,48,62] and social network analysis [25,34,59], among others. For a survey of many of the recent applications and approaches to the GMP, see the sequence of survey papers [12,19,23]. While recent results [3] have whittled away at the complexity of the related graph isomorphism problem—determining whether a permutation matrix Inline graphic exists satisfying Inline graphic—at its most general, where Inline graphic and Inline graphic are allowed to be weighted and directed, the GMP is known to be NP-hard. Indeed, in this case, the GMP is equivalent to the notoriously difficult quadratic assignment problem [7,8,36]. However, recent approaches that leverage efficient representation/learning methodologies (see, for example, [5,25,59]) have shown excellent empirical performance matching networks with up to millions of nodes.

In addition to algorithmic advancements in graph matching, there has been a flurry of activity studying the closely related problem of graph matchability; given a latent alignment between the vertex sets of two graphs, can graph matching uncover this alignment in the presence of shuffled vertex labels? This problem arises in a variety of contexts, from network de-anonymization and privatization to multi-network hypothesis testing [37] to multimodality graph embedding methodologies [10]. Many existing results are concerned with recovering a latent alignment present across random graph models, where each of Inline graphic and Inline graphic have identical marginal distributions, and exciting advancements on the threshold of matchable vs. unmatchable graphs have been made across many random graph settings, including the homogeneous correlated Erd̋s-Renyi model (see, for example, [4,17,39,44]), the correlated stochastic blockmodel setting (see, for example, [37,43]), the Inline graphic-correlated heterogeneous Erd̋s-Renyi model (see, for example, [38,40]), and in the correlated heterogeneous Erd̋s–Rényi model with varying edge correlations (see, for example, [40,49]). In the non-identically distributed model setting, the work in [1315] provide theoretic phase transitions on matchability in the Inline graphicErd̋s–Rényi (p,q,Inline graphic) model (i.e., Inline graphicErd̋s–Rényi (n,p), Inline graphicErd̋s–Rényi (n,q) and the edge correlation across graphs in provided by the constant Inline graphic; see Definition 1.2).

The above results range from providing theoretic phase transitions on matchability [13,14,37] to providing nearly efficient methods for achieving matchability from an algorithmic perspective [4,15,17,21]. While they have served to establish a novel theoretical understanding of the matchability problem, in each case the transition from matchable to unmatchable graphs is defined in terms of decreasing across-graph correlation and within-graph sparsity. Importantly, it is not a function of fundamentally different probabilistic structures across the graphs to be matched. As we often witness in applications, the graph topologies can differ significantly even among vertices that correspond to the same entity across networks. Social networks offer a compelling example of this, where matching across different social network platforms requires the understanding that not all users will be behaving homogeneously across different network platforms [34]. Both theoretically (see Example 2.2) and practically (see, for example, [10]), this distributional heterogeneity can have a deleterious effect on graph matchability.

Herein, we propose one possible solution for ameliorating the effect of Inline graphic and Inline graphic not being identically distributed, namely via a universal singular value thresholding (USVT) [9] centering preprocessing step. Working in a general correlated edge-independent random graph model (see Definition 1.1), we theoretically demonstrate that USVT centering asymptotically almost surely recovers the matchability for all but a vanishing fraction of the nodes (see Theorem 2.5). In addition, we recover analogous result (see Theorem 2.9) in the setting in which only a fraction of the vertices possess a true correspondence across networks, generalizing and extending the results of [29,44,56].

This centering step is practically implementable on even very large networks and is demonstrated to have a significant positive impact on graph matchability in both real and synthetic data settings (see Section 3). While the results contained herein do not guarantee that any computationally efficient algorithm will be able to perfectly (or almost perfectly) align any given networks after USVT centering, they provide a theoretical foundation for subsequently studying algorithmic effectiveness. Indeed, they ensure that with high probability the optimal alignment according to the graph matching objective function is, essentially, the true latent vertex alignment, guaranteeing that subsequent optimization procedures are, at the least, seeking the right permutation.

1.1 Notation

The following notation will be used throughout the manuscript: for Inline graphic, we will let Inline graphic denote the hollow Inline graphic matrix with all Inline graphic’s on its off-diagonal, Inline graphic will denote the Inline graphic matrix of all Inline graphic’s and Inline graphic will denote the set Inline graphic. We will consider Inline graphic and Inline graphic interchangeably as adjacency matrices and as the corresponding graphs consisting of vertices and edges. For a set Inline graphic, we denote by Inline graphic the induced subgraph of Inline graphic on the vertices of Inline graphic.

For a matrix Inline graphic, the Froebenius norm of the matrix is defined as

graphic file with name M39.gif

and the operator norm of Inline graphic is denoted

graphic file with name M41.gif

where Inline graphic is the largest singular value of Inline graphic. We denote the Inline graphic of Inline graphic via

graphic file with name M46.gif

Below we will make use of the following trace form of the Frobenius norm: Inline graphic; see [28] for more on the Frobenius norm and its many uses. For matrices Inline graphic and Inline graphic, we define Inline graphic via

graphic file with name M51.gif

where Inline graphic is the Inline graphic matrix of all Inline graphic’s.

We will also make extensive use of modern asymptotic notation. To review, if Inline graphic and Inline graphic are non-negative functions of Inline graphic, then we write

graphic file with name M58.gif

1.2 Correlated heterogeneous Erd̋s-Rényi graphs

Formally the GMP we will consider is defined as follows.

Definition 1.1

Let Inline graphic be the adjacency matrices of weighted, undirected graphs on Inline graphic vertices. The GMP is to find an element of

Definition 1.1 (1.1)

where Inline graphic is the set of Inline graphic permutation matrices.

Equation (1.1) follows here from

graphic file with name M64.gif

We note here that, traditionally, the GMP formulated in Definition 1.1 is defined for unweighted graphs Inline graphic and Inline graphic. The extension we consider to weighted graphs is commonly used in the literature (see, for example, the work in [53]) and is useful for studying situations in which edges/vertices in the network have weight features attached to them. This added flexibility will be needed for subsequent theoretical developments and data applications.

In the presence of a latent vertex alignment, Inline graphic, between the vertices of Inline graphic and Inline graphic, we wish to understand the extent to which graph matching Inline graphic and Inline graphic will recover Inline graphic; i.e., if Inline graphic is the permutation matrix corresponding to Inline graphic, will Inline graphic? In order to study this problem from a probabilistic perspective, we introduce a bivariate random graph model with a natural vertex alignment across graphs: the bivariate, correlated, heterogeneous, Erd̋s–Rényi random graph.

Definition 1.2

For Inline graphic symmetric, matrices, we say Inline graphic are instantiations of the Inline graphic-correlated heterogeneous Erd̋s–Rényi random graph model with parameters Inline graphic (abbreviated as Inline graphic) if

  1. Inline graphicERInline graphic; i.e., Inline graphic is an independent edge random graph with no self-loops satisfying
    graphic file with name M84.gif
    for each Inline graphic
  2. Inline graphicERInline graphic; i.e., Inline graphic is an independent edge random graph with no self-loops satisfying
    graphic file with name M89.gif
    for each Inline graphic;
  3. Edges across networks are collectively independent except that for each Inline graphic, the correlation between Inline graphic and Inline graphic is
    graphic file with name M94.gif

Before proceeding further, we will make a few remarks on the Inline graphic random graph model. In the homogeneous ERInline graphic model, network growth as Inline graphic is natural, and we can consider an asymptotic regime in which Inline graphic depends on Inline graphic. Here, we similarly consider Inline graphic and Inline graphic to be dependent on Inline graphic, but make no further assumptions on expressly how the dependence on Inline graphic is manifest. This allows for us to consider classical homogeneous Erd̋s–Rényi, stochastic blockmodels [27], random dot product graphs (conditioned on the latent positions) [57], etc., as subfamilies of our random graph model.

In addition, by allowing Inline graphic and Inline graphic to differ, this model allows for a latent correspondence to exist in settings where the underlying topology and degree structure of the graphs to be matched differs significantly. This distributional heterogeneity is often observed in real data settings (see, for example, the connectomes being aligned in [10] and the social networks aligned in [34]), and we seek to understand the limitations of graph matching approaches when attempting to overcome this heterogeneity. Note also that when Inline graphic, there are restrictions on feasible correlations Inline graphic: Indeed, if Inline graphicBernoulliInline graphic and Inline graphicBernoulliInline graphic are Inline graphic-correlated with Inline graphic, then the correlation must satisfy Inline graphic.

Lastly, this model naturally allows us to consider a partition of Inline graphic into core (Inline graphic) and junk (Inline graphic) vertices Inline graphic; core vertices are those that have a corresponding vertex (i.e., true match) across networks, while junk vertices do not. If we consider Inline graphic with Inline graphic of the form Inline graphic where Inline graphic, then it is reasonable to define Inline graphic and Inline graphic. For all Inline graphic it would then hold that Inline graphic for all Inline graphic, and Inline graphic and Inline graphic are independent random variables. A natural question to ask is when an optimal GM algorithm will correctly align the vertices in Inline graphic across networks. This problem was studied in the context of homogeneous ER networks with constant correlation in [29], and the results in Section 2.4 generalize and extend those in [29] to this more adaptable network model.

Remark 1.1

In what follows, Inline graphic and Inline graphic are not necessarily assumed to be hollow matrices, and we allow for non-zero entries on the diagonals of the Inline graphic. This is done to simplify eigen-decompositions in our proof methods. We do assume our graphs Inline graphic and Inline graphic are loop-free and have no self edges. As such, Inline graphic (resp., Inline graphic) are necessarily hollow and do not necessarily equal Inline graphic (resp., Inline graphic). We do have that for Inline graphic, Inline graphic (resp., Inline graphic), but this will not hold when Inline graphic if Inline graphic.

2. Graph matchability

In the Inline graphic setting, we seek to understand when a graph matching procedure could correctly align the vertices across networks; i.e., if Inline graphic where Inline graphic denotes the identity matrix. More generally, if Inline graphic is a dissimilarity (i.e., if it is a symmetric, non-negative function with Inline graphic for all Inline graphic; see, for example, [51]), when is it the case that

graphic file with name M151.gif

In this more general framework, we consider the following definition of graph matchability.

Definition 2.1

Let Inline graphic be a dissimilarity. We will say that Inline graphic are Inline graphic-matchable if

Definition 2.1

where Inline graphic denotes the identity matrix.

By considering an appropriate Inline graphic in Definition 2.1, we can fit the classical GMP in the formulation; indeed, the GMP of Definition 1.1 considers the dissimilarity defined via

graphic file with name M158.gif

In this paper, we will consider, more generally, dissimilarity functions of the form

graphic file with name M159.gif

for suitably defined matrix-valued function

graphic file with name M160.gif

Special cases of interest in our present Inline graphic setting are

graphic file with name M162.gif (2.1)

 

graphic file with name M163.gif (2.2)

where Inline graphic (resp., Inline graphic) is a suitable estimate of Inline graphic (resp., Inline graphic) derived from Inline graphic (resp., Inline graphic). In addition to the notion of Inline graphic-matchability for a dissimilarity Inline graphic, we will also define the notion of oracle-matchability. We will say that Inline graphic and Inline graphic distributed Inline graphic are Inline graphic-matchable if

graphic file with name M176.gif

In the sequel, oracle-matchability will provide a useful theoretic bridge between Inline graphic-matchability and Inline graphic-matchability. Note that we will write Inline graphic-matchability and Inline graphic-matchability for the notions defined in Eqs. (2.1) and (2.2), respectively.

A natural question to ask is why we define the GMP in terms of Inline graphic and not in terms of a more general dissimilarity Inline graphic; indeed, alternate dissimilarities Inline graphic have been considered in the definition of the GMP in the graph matching literature (see, for example, [60,61]). Moreover, we consider the GMP objective function formulation in Definition 1.1 even though in numerous settings the optimal solution to this GMP may not be a given latent vertex alignment, and in this section, we will see instances of when Inline graphic are, with high probability, not Inline graphic-matchable. Our choice of Inline graphic in the GMP is motivated by two main factors. First, this is the classical definition of graph matching and ties our current work to a vast graph matching literature. Secondly, we seek to understand conditions for when the original Inline graphic-matchability fails, yet there is a suitable dissimilarity Inline graphic for which Inline graphic-matchability is achieved. As the formulation in Definition 1.1 is commonly used in practice, this could provide practical guidance for when vertex labels can be recovered via a different objective function viewpoint.

In recent work addressing the question of Inline graphic-matchability, results have been established for the Inline graphic, Inline graphic setting (see, for example, [4,17,39,44]), in the correlated stochastic blockmodel setting (see, for example, [37,43]), in the correlated heterogeneous Erd̋s–Rényi model (see, for example [38,40]), and in the general Inline graphic and general Inline graphic setting (see, for example, [40,49]). In the non-identically distributed model setting, the work in [1315] considers Inline graphic, Inline graphic and Inline graphic. In each setting, the results showed that for sufficiently dense, sufficiently correlated graphs, Inline graphic-matchability is almost surely achieved. Converse results in [13,14,37,39] show that in the sufficiently sparse and/or weakly correlated setting, Inline graphic-matchability is a.s. lost (i.e., a.s. the solution to the GMP is not the latent alignment). The work in [13,14] deserves special mention, as the converse results therein are proven for general Inline graphic-matchability; i.e., for sufficiently sparse and/or weakly correlated networks, Inline graphic-matchability is a.s. lost for all dissimilarities Inline graphic.

In these examples, it is sparsity and/or weak dependence that is potentially thwarting the matching in each instance and not the heterogeneity of the model itself. As the next straightforward but illustrative example demonstrates, the degree and structural heterogeneity across networks allowed for in the Inline graphic model makes the question of Inline graphic-matchability a bit more nuanced.

Example 2.2

Consider the following correlated heterogeneous stochastic blockmodel example. Let Inline graphic be distinct, and define

Example 2.2

Let Inline graphic be the vertices in block 1 in Inline graphic and Inline graphic, and let Inline graphic be the vertices in block 2. Assuming Inline graphic and, letting Inline graphic and Inline graphic we consider Inline graphic of the form

Example 2.2

Unlike in the cases where the loss of Inline graphic-matchability is due to network sparsity and/or weak correlation, in this example the non-identically distributed nature of Inline graphic and Inline graphic can obfuscate the true alignment from a graph matching perspective. Indeed, for many choices of the parameters above, the optimal permutation for the GMP in Definition 1.1 will not be the latent correspondence, and permuting blocks 1 and 2 will, with high probability, yield a better GMP objective function.

To wit, let Inline graphic be any permutation such that Inline graphic (so that Inline graphic aligns block 1 in Inline graphic to block Inline graphic in Inline graphic and vice versa) with corresponding permutation Inline graphic. The number of edges Inline graphic such that Inline graphic is bounded above by Inline graphic and

Example 2.2

Therefore,

Example 2.2

Combined, we see that the difference in the objective function for Inline graphic as compared to Inline graphic is

Example 2.2

Numerous choices of the parameters in this model (for example, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic) yields that Inline graphic for a positive constant Inline graphic (in the example, Inline graphic). As Inline graphic is highly concentrated about its expectation (see Appendix A.1), there is high probability that Inline graphic, and

Example 2.2

i.e., Inline graphic and Inline graphic would not be Inline graphic-matchable.

2.1 Centering to recover matching

In the previous example, we see that Inline graphic can effectively make Inline graphic and Inline graphic not Inline graphic-matchable. One way to recover the latent alignment in this heterogeneous setting is to transform the problem back into the homogeneous case, and rather than matching Inline graphic and Inline graphic, we would match Inline graphic and Inline graphic; yielding once again Inline graphic. As the next theorem demonstrates, this is sufficient to a.s. recover Inline graphic-matchability under mild model conditions. Before stating the theorem, we first must define some additional notation. For Inline graphic, and permutation Inline graphic, define the matrix Inline graphic via

graphic file with name M261.gif (2.3)

For each Inline graphic in Inline graphic, define

graphic file with name M264.gif (2.4)

Theorem 2.3

Let Inline graphic and consider Inline graphic and Inline graphic. For each Inline graphic, define

Theorem 2.3

If for all Inline graphic and all Inline graphic, we have

Theorem 2.3

then

Theorem 2.3

The proof of Theorem 2.3 relies on a now standard application of McDiarmid’s inequality, and is similar to the proofs of analogous matchability results in [3739]; details of the proof can be found in Appendix A.1.

Remark 2.1

The growth condition on Inline graphic in Theorem 2.3,namely Inline graphic, is attempting to capture the necessary degree to which the entry-wise covariance matrix Inline graphic needs to be asymmetric. If we define Inline graphic, then from Eqs. (A.4) and (A.3) we that for Inline graphic,

Remark 2.1

and if Inline graphic then Inline graphic. Constraining Inline graphic globally and not entry-wise allows for more flexibility in applying the theorem to settings where some of the edges are very sparse or weakly correlated.

Consider the growth condition on Inline graphic, namely Inline graphic, in the Inline graphic, Inline graphic homogeneous ER setting (wlog, assume Inline graphic). In this setting, as Inline graphic, Inline graphic and Inline graphic are Inline graphic-matchable iff Inline graphic and Inline graphic are Inline graphic-matchable. In the sparse setting of [13,Theorem 1], Inline graphic-matchability is achieved with high probability when all of the following hold:

graphic file with name M296.gif (2.5)

 

graphic file with name M297.gif (2.6)

 

graphic file with name M298.gif (2.7)

 

graphic file with name M299.gif (2.8)

Note that these conditions cannot simultaneously hold when

graphic file with name M300.gif

Indeed if Inline graphic, Eq. (2.6) implies Inline graphic, and hence Inline graphic. Therefore,

graphic file with name M304.gif

contradicting Eq. (2.7). In the Inline graphic setting, modulo the sparsity conditions, there is a Inline graphic-matchability phase transition at Inline graphic, and the corresponding rate achieved in Theorem 2.3, namely Inline graphic, is above this phase transition threshold. We view this as the price paid in the theorem for being able to handle both heterogeneous and homogeneous ER settings.

In this setting, the growth condition of Theorem 2.3, Inline graphic, can hold in the dense setting Inline graphic as well as when Inline graphic In the dense setting of Inline graphic, Inline graphic-matchability transitions at Inline graphic [37,39], which our theorem recovers (asymptotically).

In [13], the authors establish a phase transition at Inline graphic, providing the corresponding converse result that ensures no Inline graphic-matchability if Inline graphic for all dissimilarities Inline graphic. We do not derive a corresponding converse result to Theorem 2.3 herein (namely, a condition on Inline graphic that ensures Inline graphic and Inline graphic are not Inline graphic-matchable for any suitable Inline graphic), as we are focused on how to practically recover Inline graphic-matchability in the non-identically distributed setting; see Theorem 2.5.

2.2 Approximate centering to almost recover matching

Unfortunately, centering Inline graphic and Inline graphic by the true edge probability matrices Inline graphic and Inline graphic is impractical, as these model parameters are unknown in practice. Our solution to this hurdle is to estimate the unknown Inline graphic and Inline graphic via USVT [9], and then approximately center the networks via these estimates.

Our method for estimating Inline graphic and Inline graphic is based on the USVT method of [9], and USVT applied in the present setting is outlined in Algorithm 1.

Algorithm  1 USVT for estimating Inline graphic

  • Input: Adjacency matrix Inline graphic, threshold Inline graphic;

  • 1. Let Inline graphic be the singular value decomposition of Inline graphic, with singular values ordered via Inline graphic;

  • 2. Let Inline graphic be the set of singular values greater than the threshold Inline graphic;

  • 3. Define Inline graphic

  • Output: Inline graphic defined via Inline graphic for all Inline graphic, and for Inline graphic  
    graphic file with name M346.gif

If we estimate Inline graphic via Inline graphic and Inline graphic via Inline graphic using USVT, can we recover Inline graphic-matchability using the approximately centered matrices, Inline graphic and Inline graphic, for a suitable Inline graphic? Given the error introduced in estimating the edge probability matrices, the answer is unsurprisingly `no', at least for the proof techniques we employ herein. However, if we slightly weaken Definition 2.1 to allow for a vanishing fraction of unmatched nodes, then we can recover an analogous result to Theorem 2.3. This motivates the following definition:

Definition 2.4

Let Inline graphic be a dissimilarity. Consider random graphs Inline graphic. We say that Inline graphic and Inline graphic are Inline graphic-matchable if

Definition 2.4

Unwrapping Definition 2.4, we see that Inline graphic-matchability is equivalent to any optimal permutation (under dissimilarity Inline graphic) correctly recovering the labels of at least Inline graphic vertices across Inline graphic and Inline graphic. As the following theorem indicates, under mild model assumptions, Inline graphic and Inline graphic are Inline graphic-matchable asymptotically almost surely, where the estimates Inline graphic and Inline graphic in Eq. (2.2) are the USVT estimates. The proof of Theorem 2.5 can be found in Appendix A.2.

Theorem 2.5

Let Inline graphic and further assume that for each Inline graphic,

  1. (i) There exists Inline graphic such that Inline graphic entry-wise. Note that for each Inline graphic, Inline graphic is fixed, though we allow Inline graphic to vary in Inline graphic.

  2. (ii) We have that Inline graphic.

  3. (iii) Inline graphic is approximately low rank in that there exists a Inline graphic such that Inline graphic where Inline graphic are the singular values of Inline graphic.

If for all Inline graphic and all Inline graphic, we have that there exists Inline graphic such that

Theorem 2.5

and for each Inline graphic, there exists constants Inline graphic such that if Inline graphic is the USVT estimate of Inline graphic with threshold level Inline graphic and if

Theorem 2.5

then Inline graphic and Inline graphic satisfy

Theorem 2.5

Let us take a moment to explore the assumptions in Theorem 2.5. Assumptions (i) and (ii) control the allowable sparsity of the networks, ensuring that the minimum expected degree grows asymptotically faster than Inline graphic. If the mean expected degree was Inline graphic, then the graphs would be a.s. disconnected [6], and our proof techniques fail as Inline graphic and Inline graphic would no longer concentrate about Inline graphic and Inline graphic with high probability [32]. The rank assumption in (iii) is needed to control the accuracy of the USVT estimates of the unknown Inline graphic’s. Practically, smaller Inline graphic allow us to use suitable low-rank estimates Inline graphic of Inline graphic that are computationally easier to implement; this is indeed the case in many common random graph models such as the Stochastic blockmodel [27] (where often Inline graphic), random dot product graphs [57] and latent position random graphs [26] (where Inline graphic is often taken to be Inline graphic [52]), among others.

If Inline graphic is bounded away from Inline graphic entry-wise, and each entry of Inline graphic is Inline graphic (which is indeed the case in the oft adopted setting, where each Inline graphic for a matrix Inline graphic with entries of order Inline graphic), then Inline graphic as defined in Remark 2.1 satisfies Inline graphic. We then have Inline graphic From (Eq. A.10) in the proof of Theorem 2.5, we see that Inline graphic-matchability is achieved here for

graphic file with name M422.gif

If, in addition Inline graphic and each Inline graphic, then up to a logarithmic factor Inline graphic and Inline graphic are Inline graphic-matchable, and an oracle graph matching algorithm would properly align all but potentially a vanishing fraction of the nodes across the graphs.

Remark 2.2

In Theorem 2.5,we estimate Inline graphic via Inline graphic using USVT with threshold Inline graphic. In application, often suitable estimates of Inline graphic can be obtained with rank Inline graphic of order Inline graphic or Inline graphic [52], especially in the setting of latent space graph models. For the purposes of our proof approach, suitably good means that Inline graphic (similarly for Inline graphic). We do not explore this model selection question further here (i.e., estimating a suitable rank rather than a threshold for our USVT estimates), as in applications often only a relatively small number of singular values are above the USVT threshold.

2.3 When to center?

We have seen above that in the setting where both Inline graphic and Inline graphic are not Inline graphic-matchable, centering Inline graphic and Inline graphic via Inline graphic and Inline graphic can recover Inline graphic-matchability by ameliorating the effect of the differing Inline graphic’s. Moreover, approximately centering by Inline graphic and Inline graphic theoretically recovers Inline graphic-matchability for all but a vanishing fraction of the vertices. A natural question is in the case when Inline graphic, does Theorem 2.5 imply that a.s. perfect Inline graphic-matchability is potentially lost when USVT centering is performed unnecessarily?

Consider the following simple example, where Inline graphic and Inline graphic, Inline graphic, and we vary Inline graphic and Inline graphic. In this example, there is no need to center before matching, and the variability introduced by estimating the Inline graphic’s could potentially cause

graphic file with name M457.gif

Fortunately, at least in this example we see this is not the case (see Fig. 1). As matching these graphs exactly (i.e., finding the argmin of the GMP) is computationally challenging, we use as a surrogate for Inline graphic-matchability and Inline graphic-matchability whether the true alignment is a local (rather than global) minima of the GMP before and after centering. To test this, we match the graph pairs (USVT centered and uncentered) using the constrained gradient-ascent based graph matching algorithm, FAQ [54], initialized at the true correspondence Inline graphic. While FAQ is not guaranteed to terminate at a local minima, if it terminates at Inline graphic, then that is evidence in support of Inline graphic’s local optimality. Moreover, if FAQ does not terminate at Inline graphic, then Inline graphic is not a local minima of the GMP.

Fig. 1.

Fig. 1.

We plot the mean (Inline graphic1 s.d.) of FAQInline graphicFAQInline graphic over Inline graphic and Inline graphic. The parameters considered are Inline graphic and Inline graphic.

Letting FAQInline graphic (resp., FAQInline graphic) denote the number of vertices correctly matched by FAQ initiated at Inline graphic when Inline graphic and Inline graphic are matched directly (resp., when Inline graphic and Inline graphic are USVT centered before they are matched). In Fig. 1, we plot the mean (Inline graphic1 s.d.) of FAQInline graphicFAQInline graphic over Inline graphic and Inline graphic. From the figure, we see that there is no significant performance lost by centering the graphs first. Indeed, in highly structured/low-rank settings (e.g., homogeneous ER or SBM), we can obtain high-fidelity estimates of the individual entries of the Inline graphic vs. the global estimates used in current proof. These local estimates (which can also be obtained by non-spectral methods, e.g., in the Inline graphic case we can use Inline graphic) should allow for significantly less error to be introduced in the estimation of the Inline graphic’s and will allow for little-to-no theoretic degradation due to centering. As we are more focused on the general Inline graphic case, we do not pursue this further here.

2.4 Core matchability

Often in applications, only a fraction of the vertices in Inline graphic possess a latent matched pair in Inline graphic. We will denote those vertices that have a latent match across graphs as the core vertices (denoted Inline graphic), and we will denote those vertices that do not have a latent match across graphs as the junk vertices (denoted Inline graphic). In this section, we seek to further understand the ability of an oracle graph matching procedure to correctly match the cores across graphs. This motivates the following definition of core-matchability.

Definition 2.6

Let Inline graphic, and consider a partition of the vertex sets into core and junk vertices,

Definition 2.6

Define Inline graphic to be the set of core matching permutations. For dissimilarity Inline graphic, we say that Inline graphic and Inline graphic are core  Inline graphic-matchable if

Definition 2.6

i.e., if any optimal permutation aligning Inline graphic and Inline graphic under Inline graphic perfectly matches the cores across networks.

If we consider Inline graphic with Inline graphic of the form Inline graphic where Inline graphic, then it is reasonable to define Inline graphic and Inline graphic. Indeed, under these model assumptions Inline graphic if either Inline graphic or Inline graphic is in Inline graphic.

Completely analogously to the setting considered in Example 2.2, it is immediate that Inline graphic and Inline graphic need not be asymptotically almost surely core Inline graphic-matchable even with non-vanishing core correlation in Inline graphic. Indeed, as in Example 2.2, Inline graphic and Inline graphic can be chosen to effectively obfuscate the true alignment among the core vertices. Mimicking the results of Theorem 2.3, centering Inline graphic and Inline graphic again a.s. recovers core Inline graphic-matchability of Inline graphic and Inline graphic under mild model assumptions. The proof of Theorem 2.7 is contained in Appendix A.3.

Theorem 2.7

Let Inline graphic and consider Inline graphic and Inline graphic. Suppose that Inline graphic is of the form Inline graphic where Inline graphic, and for each Inline graphic, let Inline graphic be defined as in Eq. (2.4). If for all Inline graphic we have that

Theorem 2.7 (2.9)

 

Theorem 2.7 (2.10)

and also if Inline graphic, then

Theorem 2.7

As before, if we define Inline graphic then for Inline graphic, Inline graphic. If, in addition,

graphic file with name M541.gif

then Eqs. (2.9) and (2.10) hold and core Inline graphic-matchability is recovered. In the event that Inline graphic is Inline graphic or Inline graphic then Theorem 2.7 implies that Inline graphic and Inline graphic are core Inline graphic-matchable in the presence of nearly linear junk, with arbitrary junk structure. This result extends and generalizes the results in [29] to the non-homogeneous ER setting.

As before, if the unknown Inline graphic and Inline graphic are estimated via USVT, then we recover partial core matchability. Before formalizing this, we first need the following extension of Definition 2.4 to the core-junk setting.

Definition 2.8

Let Inline graphic be a dissimilarity. Let Inline graphic, and consider a partition of the vertex sets into core and junk vertices,

Definition 2.8

Define

Definition 2.8

We say that Inline graphic and Inline graphic are core Inline graphic-matchable if

Definition 2.8

The following theorem provides the analogue of Theorem 2.5 in the core-junk setting. Note that the proof is completely analogous to that in Theorem 2.5, and so is omitted.

Theorem 2.9

Let Inline graphic and consider Inline graphic and Inline graphic. Suppose that Inline graphic is of the form Inline graphic where Inline graphic. With the assumptions on Inline graphic and Inline graphic from Theorem 2.5,and assume for each Inline graphic,

Theorem 2.9 (2.11)

 

Theorem 2.9 (2.12)

and also assume Inline graphic. For each Inline graphic, there exists constants Inline graphic such that if Inline graphic is the USVT estimate of Inline graphic with threshold level Inline graphic and if

Theorem 2.9

then

Theorem 2.9

3. Simulations and experiments

In the following sections, we explore the impact on graph matchability of USVT centering in both simulated and real data settings. We note here that precisely determining the level of Inline graphic, Inline graphic and Inline graphic-matchability is infeasible for even modestly sized networks, as this would require exactly solving the NP-hard GMP. To circumvent this, we instead match our networks using the FAQ algorithm of [54] initialized at a variety of starting points including the true correspondence. As it is a Frank–Wolfe [24] based algorithm, if FAQ terminates at the true correspondence (or at a permutation which matches a high percentage of the vertices), then the true correspondence is an estimated local minima of the GMP. Moreover, if FAQ is initialized at the true correspondence and does not terminate at the true correspondence, then the true correspondence is not a local minima. Comparing objective function values across estimated local minima then allows us to approximately gauge the global optimality of the true correspondence. While this is not the same as directly finding the global minima desired in the definition of matchability, it nonetheless provides a useful, principled heuristic for empirically studying both matchability and deviations there from.

3.1 Simulation

To explore the utility of USVT centering as a graph matching preprocessing step, we consider the following experiment. We let Inline graphic with

graphic file with name M582.gif

and Inline graphic of the form

graphic file with name M584.gif

and we use the FAQ algorithm of [54] to match (i) Inline graphic and Inline graphic directly (labelled ‘Uncentered’ in Figs 23); (ii) Inline graphic and Inline graphic (labelled ‘Centered’ in Figs 23); and Inline graphic and Inline graphic (labelled ‘Approx. Centered’ in Figs 23). In each figure, we initialize the FAQ algorithm at Inline graphic—i.e., at the true latent alignment—and at Inline graphic—i.e., at the alignment completely confusing blocks one and two across networks. We plot the mean fraction of vertices matched correctly (Inline graphics.d.) by FAQ at the starting point that achieves the lowest graph matching objective function score (averaged over Inline graphic Monte Carlo replicates). As mentioned above, if the fraction matched correctly is less than Inline graphic, then the true alignment is not a local minimum of the graph matching objective function and the graphs are not Inline graphic, Inline graphic, or Inline graphic-matchable (depending on what input FAQ is matching).

Fig. 2.

Fig. 2.

Fraction correctly matched by FAQInline graphics.d. (optimized over two different initializations: Inline graphic and at Inline graphic) vs. Inline graphic when matching (i) Inline graphic and Inline graphic directly (labelled ‘Uncentered’); (ii) Inline graphic and Inline graphic (labelled ‘Centered’); and Inline graphic and Inline graphic (labelled ‘Approx. Centered’). Here, Inline graphic captures the level of correlation between Inline graphic and Inline graphic (higher Inline graphic means more correlation), Inline graphic is fixed and results are averaged over Inline graphic Monte Carlo replicates.

Fig. 3.

Fig. 3.

Fraction correctly matched by FAQInline graphics.d. (optimized over two different initializations—Inline graphic and at Inline graphic) vs. Inline graphic when matching (i) Inline graphic and Inline graphic directly (labelled ‘Uncentered’); (ii) Inline graphic and Inline graphic (labelled ‘Centered’); and Inline graphic and Inline graphic (labelled ‘Approx. Centered’). Here, Inline graphic is fixed and results are averaged over Inline graphic Monte Carlo replicates.

In Fig. 2, we consider Inline graphic—i.e., the graphs are size Inline graphic—and in the USVT estimates Inline graphic we used Inline graphic as suggested in [9]. We plot Inline graphic vs. the the mean fraction of vertices matched correctly (Inline graphics.d.) It is unsurprising in light of Example 2.2, that Inline graphic and Inline graphic are not only directly not Inline graphic-matchable, but that the alignment found by FAQ matches none of the vertices correctly across graphs. Also of note in the figure is that as Inline graphic increases, the oracle centered graphs Inline graphic and Inline graphic appear to be nearly Inline graphic-matchable (in that the estimated local minimum found by FAQ is close to Inline graphic). The steep performance drop off as Inline graphic decreases is a consequence of the fact that in low-correlation regimes (low for a given Inline graphic), Inline graphic-matchability is often not recovered even through centering. Performance in the approximately centered case tracks performance in the centered case, with the USVT centering recovering the gains of the oracle centering. This empirically suggests that the Inline graphic lower bound in Theorem 2.5 is not sharp, as USVT centering recovers full Inline graphic-matchability in the high Inline graphic regime. We surmise that if Inline graphic is truly low-rank, USVT centering and true centering will recover perfect Inline graphic- and Inline graphic-matchability respectively as Inline graphic increases.

In Fig. 3 we repeat the above simulation with Inline graphic fixed and Inline graphic. Using Inline graphic as the USVT threshold, we again plot the mean fraction of vertices matched correctly (Inline graphic s.d.) vs. Inline graphic. As before, without centering the optimal alignment found by FAQ matches very few of the vertices correctly across graphs. The performance increase in the oracle centered setting as Inline graphic increases is a consequence of the Inline graphic-matchability (in Theorem 2.3) being an asymptotic result; indeed, we should not expect correlated small networks to be almost surely Inline graphic-matchable even with the oracle centering. We note, however, that Inline graphic (i.e., graph order Inline graphic) here is sufficient for the asymptotically perfect Inline graphic-matchability to be recovered. Again we see that performance in the approximately centered case tracks performance in the centered case, with the USVT centering achieving almost all of the gains of the oracle centering.

3.2 Twitter data

In order to analyse the impact of USVT centering with real data, we considered two graphs derived from Twitter.1 The two graphs are based on the most active twitter users from April and May 2014. The graphs are unweighted with an edge between users if a user mentioned another user during the given month. After keeping only the largest common connected component, the number of users was 431 in each graph.

As can be seen in Fig. 4, many vertices in this data set have very similar connectivity patterns. Indeed, the empirical Pearson correlation between the entries in the two adjacency matrices is Inline graphic. It is not surprising that the similar graph topology across networks leads to good performance when matching the graphs without centering. Repeating the experiment in Section 3.1—i.e., matching the adjacency matrices Inline graphic (‘April’) and Inline graphic (‘May’) using FAQ initialized at Inline graphic—yields Inline graphic vertices correctly aligned across networks. Although the true correspondence is not optimal (according to the GM objective function), the estimated local optimal correspondence does match Inline graphic of the vertices correctly across networks without the need for centering. It is worth noting that preprocessing the data via USVT centering before matching again yields Inline graphic vertices correctly matched by FAQ initialized at Inline graphic (with a suitable USVT threshold here being Inline graphic). This suggests that the centering procedure does not hurt performance when the graph topologies are similar across networks, and as we will demonstrate below, can significantly increase performance when the graph topologies differ across networks. We do note here that we do not zero out the diagonal of Inline graphic in the USVT step in this real data example, as here, hollow Inline graphic’s led to significantly worse performance than the non-hollow Inline graphic’s.

Fig. 4.

Fig. 4.

In the left two panels, we plot the adjacency matrices of aligned Twitter graphs from April and May. In the right panel, we plot the degrees of each vertex in April vs. the degrees of the same vertex in May. The vertices were both sorted according to ascending degree in the April graph.

While both the centered and uncentered graphs are highly Inline graphic- and Inline graphic-matchable, respectively, centering does have a very interesting algorithmic effect here; see Fig. 5. In the figure, we plot the number of vertices in the April–May Twitter graphs correctly matched by FAQ vs. the graph matching objective function value. In each panel (on the left matching Inline graphic and Inline graphic, and on the right Inline graphic and Inline graphic), we initialize FAQ at 100 different starting points: once at Inline graphic (labelled ‘P0 = I’ in the legend) and the rest at random permutation restarts (labelled ‘rand. start’ in the legend). The figure suggests that centering has the effect of creating a more stable objective function gap between the estimated optimal permutation and suboptimal alternatives. In a setting where multiple random restarts are possible—and needed—to recover an unknown latent alignment, this suggests that the optimal alignment is perhaps more easily recognized in the centered graph regime, and hence online stopping criterion more easily implementable.

Fig. 5.

Fig. 5.

We plot the number of vertices in the April–May Twitter graphs correctly matched by FAQ vs. the graph matching objective function value. In each panel (on the left matching Inline graphic and Inline graphic, and on the right Inline graphic and Inline graphic), we initialize FAQ at 100 different starting points: once at Inline graphic (labelled ‘P0 = I’ in the legend) and the rest at random restarts (labelled ‘rand. start’ in the legend).

To explore the performance of the two approaches (centering and not centering) in the setting of different network topologies, we consider the following synthetic data experiment. We choose Inline graphic random users from the twitter networks and add Inline graphic to their induced subgraphs in the May network, where Inline graphicERInline graphic (followed by again binarizing Inline graphic); i.e., if the set of randomly chosen vertices is Inline graphic, then Inline graphic and the subsequent Inline graphic is binarized before matching or centering. We consider Inline graphic. This experiment is simulating the setting where a fraction of the network changes its behaviour from April to May; in this case by increasing their volume of mentions month to month. For each value of Inline graphic, we repeat this experiment Inline graphic times and plot the mean accuracy (Inline graphic1 s.d.) of graph matching using FAQ initialized at Inline graphic both with and without USVT centering; see Fig. 6. This experiment demonstrates the capacity for USVT to maintain Inline graphic-matchability in the face of additive deviations in the network structure. These deviations have the effect of altering the graph topology month-to-month, and with enough signal, they have a precipitously negative impact on the performance of matching sans centering. Centering ameliorates this effect, and emphasizes the common signal in the networks by removing the effect of this additive noise. It is interesting to note that for small values of Inline graphic, centering negatively impacts algorithmic performance. We view this as potentially an artifact of the noise in these settings not being sufficient to obfuscate the true matching without centering.

Fig. 6.

Fig. 6.

In the left panel, we plot the average matching accuracy (Inline graphic s.d.) of graph matching using FAQ initialized at Inline graphic when first choosing Inline graphic random vertices, denoted Inline graphic from graph Inline graphic and then substituting Inline graphic (where Inline graphic is again binarized after noise is added) before matching and centering; here Inline graphic. Accuracy is plotted vs. Inline graphic In the right two panels, we plot the degrees of each vertex in April vs. the degrees of the same vertex in May (with the Inline graphic substitution).

In the core-junk setting, the heterogeneity among the junk vertices offers a further setting for demonstrating the utility of USVT-centered graph matching. To see this, we consider the following experiment: choose Inline graphic uniformly sampled core vertices from the twitter network and Inline graphic uniformly sampled junk vertices, Inline graphic, for the April graph and Inline graphic uniformly sampled junk vertices, Inline graphic, for the May graph. As before, we match Inline graphic and Inline graphic using FAQ initialized at Inline graphic both with and without USVT centering; results are summarized in Fig. 7. As seen previously, the ability of USVT-centering to ameliorate the degree/distributional heterogeneity (here among the junk vertices) leads to superior core label recovery compared to the uncentered matching setting.

Fig. 7.

Fig. 7.

We plot the average core matching accuracy (Inline graphic s.d.) for Inline graphic and Inline graphic using FAQ initialized at Inline graphic (with Inline graphic) against Inline graphic. Results are averaged over Inline graphic Monte Carlo iterates.

3.3 Connectomes

For our next example, we consider the diffusion MRI data from [31]. The dataset consists of test–retest pairs (used to evaluate reproducibility of the magnetization prepared rapid acquisition gradient echo image protocol). Each scan is converted into a weighted connectome by considering 70 brain regions of interest (labelled according to the Desikan brain atlas [16]) as the vertices, with edge weights counting the number of neural fiber bundles connecting the regions. See [46,47] for more detail on how these graphs were constructed. As vertices correspond to canonical brain regions of interest, it is natural to consider the true correspondence across graphs as being given by the identity mapping.

To illustrate the role of USVT centering in this data set, we first consider as an example a pair of graphs generated as above from the data in [31]. The respective adjacency matrices for this graph pair are shown in Fig. 8. Matching these brains directly using FAQ initialized at Inline graphic yields an estimated local optimum with 65 vertices correctly aligned across graphs; indeed, by permuting vertices Inline graphic, we obtain a better objective function value than the GMP evaluate at Inline graphic. We seek to understand the ability of USVT centering, which is global in nature, to correct these local mismatches.

Fig. 8.

Fig. 8.

Adjacency matrices of two sample brains from the dataset in [31].

To study this further, we apply a variant of the USVT procedure in which we automatically select the number of singular values to threshold by combining the ideas of USVT with the profile likelihood work of [63]; to wit, we select the threshold dimension via an elbow analysis of the SCREE plot of the singular values. We chose this automated procedure rather than setting a singular value threshold because these graphs are weighted, and the common threshold of Inline graphic from [9,55] is presented for the unweighted setting. Centering the pair of graphs from Fig. 8 recovers the identity Inline graphic as an estimated local minima of the GMP, and the global centering corrects the localized mismatch.

Extending this to a 41-scan sample from [31] (hence, we consider 41 graphs each with 70 vertices), we run FAQ initialized at Inline graphic for the Inline graphic pairs of distinct graphs with both USVT centering and no centering. When matching graph Inline graphic and graph Inline graphic for Inline graphic, we let

graphic file with name M736.gif

In Fig. 9, we plot a heatmap of the Inline graphic differences Inline graphic, so that the Inline graphicth entry in the heatmap corresponds to the excess number of correct matches achieved by USVT centering. Red values indicate more correctly matched via centering and blue values indicate more correctly matched via no centering. The colour intensity indicates the value of Inline graphic achieved, with darker colours indicating more (in the red case) or less (in the blue case) vertices correctly matched after centering. The figure demonstrates that the phenomena observed in the graphs in Fig. 8 was not an anomaly. Only two pairs see an improvement in matching accuracy when not centering, while Inline graphic pairs see an improvement in matching accuracy when USVT centering. Moreover, while many of the mismatches are local in nature, they are nonetheless ameliorated by the global USVT centering procedure.

Fig. 9.

Fig. 9.

For the Inline graphicweighted brain pairs, we plot a heatmap of the differences Inline graphic, so that the Inline graphicth entry in the heatmap corresponds to the excess number of correct matches achieved by USVT centering. Red values indicate more correctly matched via centering and blue values indicate more correctly matched via no centering. The colour intensity indicates the value of Inline graphic achieved, with darker colours indicating more (in the red case) or less (in the blue case) vertices correctly matched after centering.

If we consider running the same experiment on the unweighted brain graphs (using USVT centering with threshold Inline graphic), we see the delicate nature of the USVT threshold in data applications (see Figure 10). We note here that, again, in this unweighted case, we do not zero out the diagonal of Inline graphic in the USVT step in this real data example. When Inline graphic, Inline graphic pairs achieved improved matching performance when USVT centering first, and Inline graphic pairs achieved improved matching performance when not centering. When Inline graphic, Inline graphic pairs achieved improved matching performance when USVT centering first, and Inline graphic pairs achieved improved matching performance when not centering. These results suggest two important take-aways: First, performance is intimately tied to properly thresholding; and second, in this example, USVT-centering is more effective in the weighted edge case. This suggests both that the magnitude of weights are contributing significantly to the mismatch and that USVT centering is effective at ameliorating this edge weight heterogeneity; this is not entirely unexpected as the centering is precisely trying to eliminate the different edge probability/weight structures across the Inline graphic.

Fig. 10.

Fig. 10.

For the Inline graphicunweighted brain pairs, we plot a heatmap of the differences Inline graphic, so that the Inline graphicth entry in the heatmap corresponds to the excess number of correct matches achieved by USVT centering. Red values indicate more correctly matched via centering and blue values indicate more correctly matched via no centering. The colour intensity indicates the value of Inline graphic achieved, with darker colours indicating more (in the red case) or less (in the blue case) vertices correctly matched after centering. In the left panel, we center by USVT with Inline graphic; in the right panel by Inline graphic.

4. Discussion

Understanding the limits of Inline graphic-matchability is an essential step in robust multiple graph inference regimes. When graphs are not Inline graphic-matchable—i.e., the true node correspondence cannot be recovered in the face of noise—paired graph inference methodologies that utilize the across graph correspondence (see, for example, [2,50]) cannot gainfully be employed. Non-Inline graphic-matchability can limit analysis to methods which rely on graph statistics which are invariant to relabelling of the vertices, which can be useful, but lack the full power of their parametric (with the labelling as a parameter) counterparts. In this paper, we establish initial theoretical results on Inline graphic-matchability when the graphs to be matched differ in distribution, and when only a fraction of the graphs are matchable.

While our theoretical results and subsequent simulations and experiments provide a basis for a deeper understanding of the effect that distributional heterogeneity has on Inline graphic-matchability, there is still much to be done. For example, in our present USVT centering step, the across graph correlation provided by Inline graphic is not utilized. We suspect that the error in the USVT steps could be greatly reduced by leveraging Inline graphic or an estimate thereof. We are also exploring graph normalization strategies other than centering, such as kernel smoothing (using a small number of a priori known correspondences to choose the proper smoothing kernels), which may be more appropriate in the presence of multiplicative or other nonlinear noise structures. We suspect that the growth rate of Inline graphic in Theorem 2.5 is not sharp, as in simulation and real data settings the true correspondence is almost perfectly recovered via USVT centering before matching; however, sharpening the lower bound on Inline graphic with our present methods does not seem feasible and new ideas and techniques need be employed.

We suspect that the bounds on Inline graphic vs. Inline graphic obtained in Theorem 2.9 are also suboptimal. As an informal argument we can consider known results for the quadratic assignment problem for i.i.d. entries [8]. In the uncorrelated dense homogeneous Erd̋s–Rényi setting, these results imply that the best solution to the GMP, Eq. (1.1) will reduce the objective function Inline graphic as compared to a random guess. On the other hand, in the dense homogeneous Erd̋s–Rényi setting with constant correlation, the best solution is Inline graphic better than a random guess. Under the heuristic that in order for the core vertices to be matched correctly, the signal from correlation must be greater than the possible improvements in the all noise setting, we can conjecture that a core matchability threshold at approximately Inline graphic may be possible. Using our present proof technique, we are unable to achieve this rate. This heuristic argument, though problematic, provides a potential guidepost for future work.

In addition, while the theoretical results presented herein are for the case that the graphs are simple undirected graph with no edge-weights, the graph matching framework of Definition 1.1 is flexible, allowing us to accommodate many of the features—both those considered above, and additional eccentricities—inherent to real data settings. In the weighted, loopy setting, the matching can occur between the weighted adjacency matrices or normalized Laplacian matrices, Inline graphic (where Inline graphic is the diagonal matrix with Inline graphicth entry equal to Inline graphic) and the similarly defined Inline graphic. To match directed graphs, the graphs can either be made undirected (for example, by matching Inline graphic to Inline graphic) or the directed adjacency matrices can be directly plugged into Eq. (1.1). Developing similar results to Theorems 2.5 and 2.9 in models (akin to our CorrER model) that incorporate these graph features, as well as additional vertex and edge features, is a natural next step.

The information and computational limits for Inline graphic-matchability are still open problems for which we have pushed the boundaries, but significant more work is to be done. These problems are in analogy to the recently addressed problems of detection and recovery for the planted partition and planted clique problems for a single graph [1,22,42]. For these settings, exact fundamental limits have been established and polynomial time algorithms have been shown to achieve or nearly achieve these limits. Obtaining similar results for the GMP are key steps towards a robust statistical framework for multiple graph inference.

Acknowledgements

This material is based on research sponsored by the Air Force Research Laboratory and Defense Advanced Research Products Agency (DARPA) under agreement number FA8750-18-2-0035. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and DARPA or the US Government. We would also like to thank Prof. Carey E. Priebe, Prof. Minh Tang and Joshua Cape for their helpful discussions in the writing of this manuscript.

Funding

MIT Lincoln Labs and the Department of Defense (for Dr. Sussman); NIH (for Dr. Lyzinski, via grant BRAIN U01-NS108637); Air Force Research Laboratory and DARPA under agreement number FA8750-18-2-0035.

Appendix. A. Proofs

Herein, we collect proofs of the main theoretical results in this manuscript. Before stating our proofs, we first state some well-known facts about the bivariate Bernoulli distribution. Indeed, if Inline graphic model, then for each Inline graphic, Inline graphic and Inline graphic can be realized as a bivariate Bernoulli random variable. This will be a key insight in the proof of our main results, Theorems 2.3 and 2.5.

Spelling this out further, a pair of Bernoulli random variables Inline graphic has a BiBernoulli distribution with

graphic file with name M789.gif (A.1)

if Inline graphic for each Inline graphic. A key property of BiBernoulli random variables is that they can be generated by a triple of independent Bernoulli random variables. For Inline graphic as above, setting Inline graphic), Inline graphic, and Inline graphic, with Inline graphic and Inline graphic independent yields

graphic file with name M798.gif

In the Inline graphic model, we then have that

graphic file with name M800.gif

where

graphic file with name M801.gif (A.2)

are independent Bernoulli random variables.

A.1 Proof of Theorem 2.3

The key to the Proof of Theorem 2.3 is the well-known McDiarmid’s inequality [41].

Proposition A.1 (McDiarmid’s inequality). —

Let Inline graphic be a sequence of independent random variables. Let Inline graphic be such that for all Inline graphic, and for all Inline graphic  

graphic file with name M806.gif

If Inline graphic, then for any Inline graphic,

graphic file with name M809.gif

Proof of Theorem 2.3. Let Inline graphic, and consider Inline graphic and Inline graphic.

If Inline graphic and Inline graphic are Inline graphic-correlated Bernoulli random variables with respective parameters Inline graphic and Inline graphic, then it follows that

graphic file with name M818.gif

Let Inline graphic be a permutation matrix that permutes exactly Inline graphic labels and let Inline graphic be the associated permutation for Inline graphic. Note that

graphic file with name M823.gif

satisfies

graphic file with name M824.gif

where

graphic file with name M825.gif

is the number of transpositions induced by Inline graphic. Note that Inline graphic, and so

graphic file with name M828.gif (A.3)

For each Inline graphic, define the matrix Inline graphic as in Eq. 2.3. We have then

graphic file with name M831.gif (A.4)

For ease of notation, we define

graphic file with name M832.gif

so that

graphic file with name M833.gif

As Inline graphic, and each Inline graphic pair is a function of three independent Bernoulli random variables (the Inline graphic from Eq. (A.2) that are independent by construction), we have that Inline graphic is a function of

graphic file with name M838.gif

and so is a function of Inline graphic independent Bernoulli random variables, where Inline graphic then satisfies

graphic file with name M841.gif

Changing the value of one of these Bernoulli random variables leaving all others fixed can change the value of at most one Inline graphic pair, and this pair appears in two terms in the sum Inline graphic. As each term in the sum of Inline graphic is bounded in Inline graphic, we have that, in the notation of Proposition A.1, each Inline graphic can be uniformly set to Inline graphic. Proposition A.1 then yields (setting Inline graphic)

graphic file with name M849.gif (A.5)

with Inline graphic being an appropriate positive constant that may change line to line. If Inline graphic then

graphic file with name M852.gif (A.6)

To finish the proof, we apply a union bound on all such Inline graphic. The number of such permutations Inline graphic that permute Inline graphic vertex labels is upper bounded by Inline graphic. Combining with Eq. (A.6), we have that under the assumptions of the theorem

graphic file with name M857.gif

as desired.□

A.2 Proof of Theorem 2.5

Key to the proof of Theorem 2.5 are the following lemmas, adapted here from [55,Lemmas 1 and 2].

Lemma A.1

Let Inline graphic. Suppose Inline graphic for Inline graphic fixed. Let Inline graphic be the singular value decomposition of Inline graphic, and let

Lemma A.1

Then

Lemma A.1

where Inline graphic are the singular values of Inline graphic.

Lemma A.2

Let Inline graphicER(Inline graphic); i.e., Inline graphic is a hollow, symmetric matrix with Inline graphic. Assume Inline graphic for some Inline graphic. If Inline graphic for a constant Inline graphic, then for all Inline graphic there exists a constant Inline graphic such that

Lemma A.2

Note first that

graphic file with name M878.gif

with the analogous result holding for Inline graphic. Under the assumptions of Theorem 2.5, Lemma A.2 implies that there exists constants Inline graphic such that

graphic file with name M881.gif

with probability at least Inline graphic, and therefore there exists constants Inline graphic such that

graphic file with name M884.gif

with probability at least Inline graphic.

Next, we apply Lemma A.1 with Inline graphic, Inline graphic (resp., Inline graphic, Inline graphic). With probability at least Inline graphic, there exists constants Inline graphic such that if Inline graphic then (where Inline graphic is as defined in the USVT pseudocode, Algorithm 1)

graphic file with name M894.gif

where the equality follows from the rank assumption on Inline graphic in Theorem 2.5; similarly, for Inline graphic we have

graphic file with name M897.gif

Combining the above, we have that there exists an event Inline graphic such that Inline graphic, and on Inline graphic,

graphic file with name M901.gif

To prove Theorem 2.5, we proceed as follows. Fix Inline graphic so that Inline graphic; i.e., Inline graphic permutes exactly Inline graphic labels. Two simple applications of the triangle inequality yields that

graphic file with name M906.gif

and

graphic file with name M907.gif

Combining the above, we have that

graphic file with name M908.gif

In the proof of Theorem 2.3, if we set Inline graphic in Eq. A.5 when applying McDiarmid’s inequality, then under the assumptions of Theorem 2.5, there exists an event Inline graphic with Inline graphic such that on Inline graphic,

graphic file with name M913.gif

Next, note that

graphic file with name M914.gif

Hoeffding’s inequality (see, for example, [11]) yields that

graphic file with name M915.gif (A.7)

 

graphic file with name M916.gif (A.8)

with probability at least

graphic file with name M917.gif (A.9)

(where the last inequality followed from the assumptions in the Theorem, as under the assumptions Inline graphic). Therefore there exists an event Inline graphic such that Eq. (A.7) and (A.8) hold on Inline graphic, and Inline graphic

Writing

graphic file with name M922.gif

we see that

graphic file with name M923.gif

We see then that on Inline graphic,

graphic file with name M925.gif (A.10)

Let Inline graphic be such that

graphic file with name M927.gif

If Inline graphic for Inline graphic, Inline graphic on Inline graphic, where

graphic file with name M932.gif

Define the event

graphic file with name M933.gif

and note that

graphic file with name M934.gif

Combined, this yields

graphic file with name M935.gif

as desired.□

A.3 Proof of Theorem 2.7

For a given permutation Inline graphic on Inline graphic, we define the permutation Inline graphic uniquely as follows:

graphic file with name M939.gif (A.11)

For example, if Inline graphic, Inline graphic, and

graphic file with name M942.gif

then

graphic file with name M943.gif

For a permutation matrix Inline graphic, we define Inline graphic analogously, where we recall here that Inline graphic is the set of permutation matrices Inline graphic in Inline graphic satisfying Inline graphic (i.e., fixing all core labels). Define Inline graphic. Define the events

graphic file with name M951.gif

Inline graphic is the event that the optimal GMP permutation is not in Inline graphic and the graphs are not core Inline graphic-matchable for Inline graphic. As Inline graphic, we have that Inline graphic.

Suppose that Inline graphic (with corresponding permutation Inline graphic) permutes Inline graphic core labels, where

graphic file with name M961.gif (A.12)

 

graphic file with name M962.gif (A.13)

Applying the results in Appendix A on the Bivariate Bernoulli distribution, we see that Inline graphic is a function of Inline graphic independent Bernoulli random variables, where

graphic file with name M965.gif

As in the proof of Theorem 2.3, we next apply Proposition A.1 to bound the probability that Inline graphic provides a better matching than Inline graphic. By the assumption that Inline graphic if either Inline graphic or Inline graphic is a junk vertex, it holds that

graphic file with name M971.gif

To ease notation, we define Inline graphic, so that

graphic file with name M973.gif

To use a union bound, note that the number of permutations Inline graphic with error counts in Eqs. (A.12A.13) given by Inline graphic and Inline graphic is bounded above by Inline graphic. Let Inline graphic be the number of core vertices permuted by Inline graphic. Hence, if Inline graphic and

graphic file with name M981.gif

then we have that

graphic file with name M982.gif

as desired.□

Footnotes

1

These graphs were provided as part of the DARPA XDATA project.

Contributor Information

Vince Lyzinski, Email: vlyzinsk@umd.edu.

Daniel L Sussman, Email: sussman@bu.edu.

References

  • 1. Abbe E. & Sandon C. (2015) Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, pp. 670–688. [Google Scholar]
  • 2. Asta D. & Shalizi C. (2015) Geometric network comparisons. Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI’15. Arlington, Virginia, USA: AUAI Press, pp. 102–110. [Google Scholar]
  • 3. Babai L. (2016) Graph isomorphism in quasipolynomial time. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing. New York, NY, USA: ACM, pp. 684–697. [Google Scholar]
  • 4. Barak B., Chou C., Lei Z., Schramm T. & Sheng Y. (2019) (Nearly) efficient algorithms for the graph matching problem on correlated random graphs. Advances in Neural Information Processing Systems. 9186–9194
  • 5. Bayati M., Gerritsen M., Gleich D. F., Saberi A. & Wang Y. (2009) Algorithms for large, sparse network alignment problems. 2009 Ninth IEEE International Conference on Data Mining. Miami, FL: IEEE, pp. 705–710. [Google Scholar]
  • 6. Bollobás B. (2001) Random Graphs. Cambridge University Press. [Google Scholar]
  • 7. Bougleux S., Brun L., Carletti V., Foggia P., Gaüzère B. & Vento M. (2017) Graph edit distance as a quadratic assignment problem. Pattern Recogn. Lett., 87, 38–46. [Google Scholar]
  • 8. Cela E. (2011) The Quadratic Assignment Problem: Theory and Algorithms; Combinatorial Optimization, vol. 1. New York, NY: Springer. [Google Scholar]
  • 9. Chatterjee S. (2015) Matrix estimation by universal singular value thresholding. Ann. Stat., 43, 177–214. [Google Scholar]
  • 10. Chen L., Vogelstein J. T., Lyzinski V. & Priebe C. E. (2016) A joint graph inference case study: the c. elegans chemical and electrical connectomes. Worm, vol. 5. [DOI] [PMC free article] [PubMed]
  • 11. Chung F. & Lu L (2006) Concentration inequalities and martingale inequalities: a survey. Int. Math., 3, 79–127. [Google Scholar]
  • 12. Conte D., Foggia P., Sansone C. & Vento M. (2004) Thirty years of graph matching in pattern recognition. Int. J. Pattern Recogn. Artificial Intell., 18, 265–298. [Google Scholar]
  • 13. Cullina D. & Kiyavash N. (2016) Improved achievability and converse bounds for Erd̋s–Rényi graph matching. ACM SIGMETRICS Performance Evaluation Review, vol. 44 New York, NY, USA: ACM, pp. 63–72. [Google Scholar]
  • 14. Cullina D. & Kiyavash N. (2017) Exact alignment recovery for correlated Erd̋s–Rényi graphs. arXiv preprint arXiv:1711.06783.
  • 15. Cullina D., Kiyavash N., Mittal P. & Poor H. V. (2018) Partial recovery of Erd̋s–Rényi graph alignment via Inline graphic-core alignment. arXiv preprint arXiv:1809.03553.
  • 16. Desikan R. S., Ségonne F., Fischl B., Quinn B. T., Dickerson B. C., Blacker D., Buckner R. L., Dale A. M., Maguire R. P. & Hyman B. T (2006) An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage, 31, 968–980. [DOI] [PubMed] [Google Scholar]
  • 17. Ding J., Ma Z., Wu Y. & Xu J. (2018) Efficient random graph matching via degree profiles. arXiv preprint arXiv:1811.07821.
  • 18. Elmsallati A., Clark C. & Kalita J (2016) Global alignment of protein-protein interaction networks: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform., 13, 689–705. [DOI] [PubMed] [Google Scholar]
  • 19. Emmert-Streib F., Dehmer M. & Shi Y. (2016) Fifty years of graph matching, network alignment and network comparison. Inform. Sci., 346–347, 180–197. [Google Scholar]
  • 20. Escolano F., Hancock E. R. & Lozano M. (2011) Graph matching through entropic manifold alignment. 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA: IEEE Computer Society, pp. 2417–2424. [Google Scholar]
  • 21. Fang F., Sussman D. L. & Lyzinski V. (2018) Tractable graph matching via soft seeding. arXiv preprint arXiv:1807.09299.
  • 22. Feldman V., Grigorescu E., Reyzin L., Vempala S. & Xiao Y. (2013) Statistical algorithms and a lower bound for detecting planted cliques. Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing, STOC’13. New York, NY, USA: ACM, pp. 655–664. [Google Scholar]
  • 23. Foggia P., Percannella G. & Vento M. (2014) Graph matching and learning in pattern recognition in the last 10 years. Int. J. Pattern Recogn. Artificial Intell., 28, 1450001. [Google Scholar]
  • 24. Frank M. & Wolfe P. (1956) An algorithm for quadratic programming. Nav. Res. Logist. Q., 3, 95–110. [Google Scholar]
  • 25. Heimann M., Shen H., Safavi T. & Regal D. K. (2018) Representation learning-based graph alignment. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York, NY, USA: ACM, pp. 117–126. [Google Scholar]
  • 26. Hoff P. D., Raftery A. E. & Handcock M. S. (2002) Latent space approaches to social network analysis. J. Amer. Statist. Assoc., 97,1090–1098. [Google Scholar]
  • 27. Holland P. W., Laskey K. B. & Leinhardt S. (1983) Stochastic blockmodels: First steps. Soc. Netw., 5,109–137. [Google Scholar]
  • 28. Horn R. A. & Johnson C. R. (2012) Matrix Analysis. Cambridge, United Kingdom: Cambridge University Press. [Google Scholar]
  • 29. Kazemi E., Yartseva L. & Grossglauser M. (2015) When can two unlabeled networks be aligned under partial overlap?  2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). Washington, DC, USA: IEEE Computer Society, pp. 33–42. [Google Scholar]
  • 30. Klau G. W. (2009) A new graph-based method for pairwise global network alignment. BMC Bioinform., 10, S59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Landman B. A., Huang A. J., Gifford A., Vikram D. S., Lim I. A. L., Farrell J. A. D., Bogovic J. A., Hua J., Chen M. & Jarso S. (2011) Multi-parametric neuroimaging reproducibility: a 3-t resource study. Neuroimage, 54, 2854–2866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Le C. M., Levina E. & Vershynin R. (2017) Concentration and regularization of random graphs. Random Struct. Algor., 51,538–561. [Google Scholar]
  • 33. Lee J., Cho M. & Lee K. M. (2010) A graph matching algorithm using data-driven markov chain monte carlo sampling. 2010 20th International Conference on Pattern Recognition (ICPR). Washington, DC, USA: IEEE Computer Society, pp. 2816–2819. [Google Scholar]
  • 34. Li L. & Campbell W. M. (2015) Matching community structure across online social networks. NIPS Workshop on Networks in the Social and Information Sciences. Montreal, Quebec, Canada. [Google Scholar]
  • 35. Lin L., Liu X. & Zhu S.-C. (2010) Layered graph matching with composite cluster sampling. IEEE Trans. Pattern Anal. Mach. Intell., 32, 1426–1442. [DOI] [PubMed] [Google Scholar]
  • 36. Loiola E. M., de Abreu N. M. M., Boaventura-Netto P. O., Hahn P. & Querido T. (2007) A survey for the quadratic assignment problem. Eur. J. Oper. Res., 176,657–690. [Google Scholar]
  • 37. Lyzinski V. (2018) Information recovery in shuffled graphs via graph matching. IEEE Trans. Inform. Theory, 64, 3254–3273. [Google Scholar]
  • 38. Lyzinski V., Fishkind D. E., Fiori M., Vogelstein J. T., Priebe C. E. & Sapiro G. (2016) Graph matching: relax at your own risk. IEEE Trans. Pattern Anal. Mach. Intell., 38,60–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Lyzinski V., Fishkind D. E. & Priebe C. E. (2014) Seeded graph matching for correlated Erdos–Renyi graphs. J. Mach. Learn. Res., 15, 3513–3540. [Google Scholar]
  • 40. Lyzinski V., Levin K. & Priebe C. E. (2019) On consistent vertex nomination schemes. J. Mach. Learn. Res, 20, 1–39. [Google Scholar]
  • 41. McDiarmid C. (1989) On the method of bounded differences. Surveys Combin., 141, 148–188. [Google Scholar]
  • 42. Mossel E., Neeman J. & Sly A. (2014) Belief propagation, robust reconstruction and optimal recovery of block models. Proceedings of The 27th Conference on Learning Theory, pp. 356–370.
  • 43. Onaran E., Garg S. & Erkip E. (2016) Optimal de-anonymization in random graphs with community structure. 2016 50th Asilomar Conference on Signals, Systems and Computers. Washington, DC, USA: IEEE Computer Society, pp. 709–713. [Google Scholar]
  • 44. Pedarsani P. & Grossglauser M. (2011) On the privacy of anonymized networks. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 1235–1243. [Google Scholar]
  • 45. Robles-Kelly A. & Hancock E. R. (2007) A riemannian approach to graph embedding. Pattern Recogn., 40,1042–1056. [Google Scholar]
  • 46. Gray W. R., Bogovic J. A., Vogelstein J. T., Landman B. A., Prince J. L. & Vogelstein R. J. (2012) Magnetic resonance connectome automated pipeline: an overview. IEEE Pulse, 3, 42–48. [DOI] [PubMed] [Google Scholar]
  • 47. Roncal W. R., Koterba Z. H., Mhembere D., Kleissas D. M., Vogelstein J. T., Burns R., Bowles A. R., Donavos D. K., Ryman S. & Jung R. E. (2013) MIGRAINE: MRI graph reliability analysis and inference for connectomics. IEEE Global Conf. Signal Inform. Process., 313–316. [Google Scholar]
  • 48. Sang J. & Xu C. (2012) Robust face-name graph matching for movie character identification. IEEE Trans. Multimedia, 14, 586–596. [Google Scholar]
  • 49. Sussman D. L., Lyzinski V., Park Y. & Priebe C. E. (2019) Matched filters for noisy induced subgraph detection. IEEE Trans. Pattern Anal. Mach. Intell., 1–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Tang M., Athreya A., Sussman D. L., Lyzinski V., Park Y. & Priebe C. E. (2017) A semiparametric two-sample hypothesis testing problem for random dot product graphs. J. Comput. Graph. Statist., 26, 344–354. [Google Scholar]
  • 51. Trosset M. W., Priebe C. E., Park Y. & Miller M. I. (2008) Semisupervised learning from dissimilarity data. Comput. Statist. Data Anal., 52, 4643–4657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Udell M. & Townsend A. (2017) Nice latent variable models have log-rank. arXiv preprint arXiv:1705.07474.
  • 53. Umeyama S. (1988) An eigendecomposition approach to weighted graph matching problems. IEEE Trans. Pattern Anal. Mach. Intell., 10, 695–703. [Google Scholar]
  • 54. Vogelstein J. T., Conroy J. M., Lyzinski V., Podrazik L. J., Kratzer S. G., Harley E. T., Fishkind D. E., Vogelstein R. J. & Priebe C. E. (2014) Fast approximate quadratic programming for graph matching. PLoS ONE, 10(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Xu J. (2017) Rates of convergence of spectral methods for graphon estimation. arXiv preprint arXiv:1709.03183.
  • 56. Yartseva L. & Grossglauser M. (2013) On the performance of percolation graph matching. Proceedings of the First ACM Conference on Online Social Networks. New York, NY, USA: ACM, pp. 119–130. [Google Scholar]
  • 57. Young S. & Scheinerman E. (2007) Random dot product graph models for social networks. Proceedings of the 5th International Conference on Algorithms and Models for the Web-Graph. Berlin, Heidelberg: Springer, pp. 138–149. [Google Scholar]
  • 58. Zaslavskiy M., Bach F. & Vert J. P. (2009) A path following algorithm for the graph matching problem. IEEE Trans. Pattern Anal. Mach. Intell., 31, 2227–2242. [DOI] [PubMed] [Google Scholar]
  • 59. Zhang S. & Final H. T. (2016) Fast attributed network alignment. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 1345–1354. [Google Scholar]
  • 60. Zhang Y. (2018) Consistent polynomial-time unseeded graph matching for Lipschitz graphons. arXiv preprint arXiv:1807.11027.
  • 61. Zhang Y. (2018) Unseeded low-rank graph matching by transform-based unsupervised point registration. arXiv preprint arXiv:1807.04680.
  • 62. Zhou F. & De la Torre F. (2012) Factorized graph matching. 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA: IEEE Computer Society, pp. 127–134. [Google Scholar]
  • 63. Zhu M. & Ghodsi A. (2006) Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Statist. Data Anal., 51, 918–930. [Google Scholar]

Articles from Information and Inference are provided here courtesy of Oxford University Press

RESOURCES