Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Nov 7.
Published in final edited form as: J Am Stat Assoc. 2015 Nov 7;110(511):1047–1056. doi: 10.1080/01621459.2015.1008697

Testing and Modeling Dependencies Between a Network and Nodal Attributes

Bailey K Fosdick 1, Peter D Hoff 2
PMCID: PMC4734763  NIHMSID: NIHMS657442  PMID: 26848204

Abstract

Network analysis is often focused on characterizing the dependencies between network relations and node-level attributes. Potential relationships are typically explored by modeling the network as a function of the nodal attributes or by modeling the attributes as a function of the network. These methods require specification of the exact nature of the association between the network and attributes, reduce the network data to a small number of summary statistics, and are unable provide predictions simultaneously for missing attribute and network information. Existing methods that model the attributes and network jointly also assume the data are fully observed. In this article we introduce a unified approach to analysis that addresses these shortcomings. We use a previously developed latent variable model to obtain a low dimensional representation of the network in terms of node-specific network factors. We introduce a novel testing procedure to determine if dependencies exist between the network factors and attributes as a surrogate for a test of dependence between the network and attributes. We also present a joint model for the network relations and attributes, for use if the hypothesis of independence is rejected, that can capture a variety of dependence patterns and be used to make inference and predictions for missing observations.

Keywords: hypothesis test, joint model, latent variable model, prediction, relational data

1 Introduction

A common goal in the analysis of network data is to characterize the dependence between network relations and a set of node-specific attributes. For example in recent years many studies in the social sciences have examined the relationship between individuals’ friendship networks and their health measures, such as happiness, smoking and drinking behavior, and obesity (e.g. Veenstra et al. (2013) and references therein). Similarly in the biological sciences, scientists are interested in the relationship between how proteins interact and their biological importance (see Butland et al. (2005) for example). In each of these applications, the data consists of two parts: the network relations {yi,j : i, j ∈ {1, …, n}} representing a measure of the directed relationship between each pair of nodes i and j, and p-variate nodal attributes {xi : i ∈ {1, …, n}}. In the case of a social network, the nodes, network relations Y = {yij}, and attributes X = {xi}, often represent people, their friendships, and their demographic and behavioral characteristics, respectively.

Traditional approaches to describing the dependence between a network and attributes rely on statistical methods that model either the network conditional on the attributes, {Y|X}, or the attributes conditional on the network, {X|Y}. In the social sciences, this first perspective parallels the theory of “social selection”, whereby individuals’ attributes influence the formation of their social relations, and the second perspective is motivated by “social influence” theory whereby individuals’ relations affect their attributes.

Methods that model the network as a function of the attributes, {Y|X}, commonly specify a regression framework for the dependence: the probability of the relation yi,j is a function of βTxi,j where xi,j = f(xi, xj), xi and xj are the attributes for nodes i and j, and β is an unknown parameter vector. The covariate vector xi,j typically includes terms for each attribute of the sender node i and receiver node j, as well as interaction terms measuring the similarity between the sender and receiver attributes. These interaction terms are frequently specified as the absolute difference between the attributes, an indicator of whether an attribute is the same for both the sender and receiver node (in the case of discrete attributes), or the product of the nodes’ attributes. Examples of network models that can accommodate such a regression term are exponentially parameterized random graph models (ERGM) (Frank and Strauss (1986), Wasserman and Pattison (1996), Snijders et al. (2006), Hunter and Handcock (2006)) and latent variable models (Hoff et al. (2002), Hoff (2005)). This latter class of models regresses a function of the network on both attribute terms and node-specific latent variables; Austin et al. (2013) proposed a slight modification to this model class where the network is expressed as a function of nodal latent variables and the latent variables themselves are regressed on the attributes.

Methods for assessing the impact of the network on nodal attributes, {X|Y}, often regress each node’s attributes on the attributes of other nodes in the network according to the network relations. For example, Christakis and Fowler (2007) use a logistic regression model to estimate the degree to which an individual’s obesity status can be explained by the obesity status of individuals in their social network (children, neighbors, spouse, etc.) (see Cohen-Cole and Fletcher (2008) and Lyons (2011) for additional discussion of these methods). Other similar models include the auto-regressive network effects models of Erbring and Young (1979) and Marsden and Friedkin (1993) and the p* social influence models of Robins et al. (2001). All of these models are univariate, focusing a single attribute of interest that is possibly subject to social influence.

While modeling the network and attributes as functions of one another is able to provide some insight into their dependence structure, there are two primary drawbacks to utilizing these methods for analysis. First, neither modeling framework allows for simultaneous inference about the dependencies between and among the network relations and attributes. For example, when analyzing data on an adolescent friendship network and individuals’ health behaviors there may be interest in whether smoking habits and obesity status are conditionally independent given the network. Addressing this question of dependence between attributes conditional on the network is impossible using either of the conditional modeling frameworks. A second limitation of these methods is that they are unable to accommodate, and provide predictions for, datasets that have both missing network and attribute information. In the conditional modeling frameworks either the network or attributes are assumed to be fully observed.

Fellows and Handcock (2012) proposed a new class of models for jointly modeling a network and attributes called exponential-family random network models which is a combination of an ERGM and a Gibbs random field. This joint model addresses the first limitation of the conditional models and could potentially (with modification) address the second limitation of imputing missing data. However these models, like ERGMs, require careful specification to avoid model degeneracy issues (Handcock (2003), Schweinberger (2011)) and require explicit elicitation of the possible relationships between the network and attributes. Kim and Leskovec (2011) and Kim and Leskovec (2012) proposed a simple joint attribute and network model for which mathematical analysis on network connectivity and degree distributions is tractable. A limitation of this model class is that it only accommodates categorical attributes and assumes no missing attribute or network information. Both of these existing joint modeling frameworks lack traditional procedures for testing whether the joint model is appropriate, i.e., if dependencies even exist between the network and attributes.

In this article, we propose a unified approach to the analysis of network and attribute data. Our approach builds off existing network models and inference procedures and makes two key contributions to statistical network methodology. The first contribution is a testing strategy for determining whether dependencies exist between the network and attributes. The second contribution, for use when the test rejects independence, is a joint modeling framework for the network and attributes that allow for predictions for missing information and inference on the dependencies between and within the network and attributes. Neither of these objectives can be addressed with existing network analysis methods. Our proposed methodology can be summarized as follows. Investigating the dependence between network data Y and attribute data X is difficult since network data are often high dimensional, containing relational information on each pair of nodes. For this reason, in Section 2 we propose using a previously developed network model to represent the (n × n) matrix of network relations Y with a low-dimensional structure defined by an (n × r) matrix N of node-specific network factors (rn). These network factors N are not observed directly and hence are estimated from the observed network Y using a network model.

In Section 3 we introduce a novel testing strategy for evaluating whether dependencies exist between the network Y and attributes X by formally testing for correlation between the estimated network factors N and the attributes X. A conceptual representation of this testing framework is shown in Figure 1. If the network is independent of the attributes, then any functions of the network, specifically the network factors, are also independent of the attributes. A primary advantage of this approach is that the overall relationship between the network and an arbitrary number of attributes can be assessed without needing to construct a complex regression model of the network relations on the attributes or perform variable selection. In Section 4 we show via simulation that there is little loss in power for the test due to not observing the network factors directly.

Figure 1.

Figure 1

The primary patterns in the network Y are represented by r node-specific factors N. To determine if dependencies exist between the network Y and the p nodal attributes X, we propose a testing for a relationship between the network factors N and attributes X.

In Section 5, we propose a novel joint model for the network and attributes, for use when the hypothesis of independence between the network factors and attributes is rejected, that is able to capture dependence such as that detected by the test. This joint model allows for simultaneous estimation and inference on the dependence between and within the network and attributes, as well as provides methodology for handling and predicting missing network and attribute data. We show that the joint model conditional on the attributes can be viewed as a reduced rank regression of the network relations on attribute interactions. This further motivates the joint model as a mechanism for parsimoniously characterizing attribute and network dependence. In Section 6 our proposed methodology in used to analyze data from the National Longitudinal Study of Adolescent Health. In a cross validation experiment, we demonstrate that predictions of missing attribute data can be improved by basing imputations on both observed network and attribute information rather than attribute data alone. We conclude with a discussion in Section 7.

2 Calculation of node-specific network factors

The latent space network models in Hoff et al. (2002) and latent variable models in Hoff (2005) and Hoff (2009) provide parsimonious representations of the patterns in a network using node-specific latent factors. These models have been shown to capture a variety of network dependence patterns such as homophily, transitivity, reciprocity, and heterogeneity in node sociability and popularity. We consider an extension of the model presented in Hoff (2009) that contains additive and multiplicative latent effects, as well as structure for within dyad correlation. This model is a mosaic composed of common network effects well established in the literature. Ultimately, we use this model to obtain a low-dimensional representation of the network in terms of interpretable node-specific factors. We describe the model for continuous network data, and at the end of this section we briefly discuss how these methods can be extended to model ordinal or binary relations.

Let yi,j represent a continuous measure of the directed relation from node i to node j. For example, in an adolescent friendship network, relation yi,j may represent the aggregate call time of phone calls initiated by person i to person j in the last month. We consider the following model:

yi,j=μ+ai+bj+uiTυj+ei,j,ai,bj,ui,υjk. (1)

The overall mean relation is represented by μ and the random error by ei,j. The additive sender effect ai and receiver effect bj are often interpreted as a measure of node i’s sociability (i.e. outgoingness) and node j’s popularity respectively. The multiplicative interaction effect uiTυj can capture higher order dependence, such as network transitivity, balance, and clustering (Hoff (2005)). One interpretation of the these effects stems from the concept of an underlying social space (McFarland and Brown (1973), Faust (1988)), whereby nodes that are close to one another in the underlying space exhibit similar network patterns. In this context, the node-specific sender factors ui and receiver factors υi can be interpreted as k-dimensional representations of the underlying outgoing (sending) and incoming (receiving) behaviors of node i. A similar interpretation was used to motivate the latent position models in Hoff et al. (2002).

The random errors are modeled as Gaussian, independent across dyads, and correlated within a dyad:

(ei,j,ej,i)T~iidnormal2(0,σe2(1ρρ1)). (2)

The additive and multiplicative node-specific factors are also modeled as Gaussian and independent across nodes:

(ai,bi,uiT,υiT)T~iidnormal2+2k(0,Σabuυ)Σabuυ=(ΣabΣab,uυΣuυ,abΣuυ). (3)

The within dyad correlation ρ is interpreted as a measure of network relation reciprocity and together with the additive effects ai and bj induces the covariance structure from the social relations model (Warner et al. (1979), Wong (1982)).

A Bayesian estimation procedure for this network model has been implemented in the ‘amen’ package in the open source computing software program R; however the implementation restricts Σab,uυ = 0. Under this restriction the model can capture third-order dependence patterns between relation “cycles”, such as {yi,j, yj,k, yk,i} or {yi,j, yj,k, yi,k} where the edges create a closed loop (ignoring direction), but not between noncyclic relation triples such as {yi,j, yj,i, yk,i}. By allowing the additive and multiplicative effects to be dependent (i.e. Σab,uυ ≠ 0) as in (3), the model is able to capture a larger class of dependencies. Specifically, it can capture correlation among sets of relation triples where each relation in the set shares at least one node with another relation in the set (i.e. dependence between {yi,j, yj,k, yk,l}, but not between {yi,j, yj,k, ym,l}). One justification for allowing such dependence is that latent factors that act additively, affecting node popularity and sociability, plausibly also impact the network in a multiplicative manner. A modified version of the ‘amen’ R package that supports Bayesian parameter estimation for the network model presented here is available at the corresponding author’s website.

Motivation via the singular value decomposition

A key strength of the network model in (1) is its ability to capture a variety of common network phenomena, however an alternative motivation for the model stems from its relationship to the singular value decomposition (SVD) (see Hoff (2009) and Azari and Airoldi (2012) for discussion of similar motivations for related models). The singular value decomposition is a matrix factorization that is commonly used to obtain an approximation of a matrix M by another matrix which is of reduced rank and contains the main patterns of the original matrix M. The SVD-based approximation is the optimal matrix approximation of its rank with respect to squared error loss. Here we show that the model in (1) is similar to an SVD-based approximation of the network Y, and hence can be viewed as a low dimensional representation of the network that captures the dominant patterns in the relations.

The network model in (1) can be written in matrix form as Y = M + E, where

M=μ1n1nT+a1nT+1nbT+UVT, (4)

1n is an (n × 1) vector whose elements all equal one, a and b are (n × 1) vectors of the additive sender and receiver factors, U and V are (n × k) matrices of multiplicative factors, and E is an (n × n) matrix of errors.

The singular value decomposition of an arbitrary (n × n) matrix Y is written Y = ACBT, where A and B are orthogonal (n × n) matrices and C is an (n × n) diagonal matrix with non-negative decreasing diagonal elements. The rank-k matrix that best approximates Y based on squared-error loss is given by = ÂĈB̂T where  = A[, 1: k], Ĉ = C[1 : k, 1: k] and = B[, 1: k]. Absorbing Ĉ into  and/or , the best rank-k approximation is written = ĂB̆T. Letting μA, μB ∈ ℝk contain the columns means of Ă and respectively, can be expressed:

M^=μAB1n1nT+a˜1nT+1nb˜T+A˜B˜T, (5)
A˜=(A1nμAT),a˜=A˜μB,μAB=μATμB,B˜=(B1nμBT),b˜=B˜μA,

where à and represent multiplicative factors with mean-zero columns, ã and represent mean-zero row and column factors, and μAB is an overall mean.

Observe that the representation in (5) resembles that in (4) for the network model. This illustrates the additive and multiplicative effects structure in the network model is similar to a rank-k matrix approximation of the network. Note that in the decomposition in (5), there is functional dependence between the overall mean, additive, and multiplicative effects. Since there are no such restrictions in the network model, the latent network effects represent a slightly larger class of approximations than that given by the set of matrices with rank-k.

Non-identifiability of multiplicative factors

From the representation in (4), it is evident that the multiplicative network effects (U, V) individually are non-identifiable: the probability model for Y in (1) is the same with multiplicative latent factors (U, V) as it is with factors (UAT, VA−1) for any nonsingular (k × k) matrix A. In other words, all multiplicative factors which satisfy {(UAT, VA−1) : A (k × k) nonsingular} are equivalent with respect to the probability model for Y. In the absence of prior information which prefers one parameter over another within the same equivalence class, inference for the relative likeliness of parameter values should be conducted over the equivalence classes. Performing inference over the equivalence classes of (U, V) is equivalent to performing inference on the posterior distribution of UVT. This issue is discussed further in Sections 3 and 5.

Non-continuous network measures

Observed network information is often not continuous. For example it is common for network data to be binary where yi,j is an indicator of whether the relation between nodes i and j exceeds some threshold, or ordinal where yi,j represents, for instance, the relative rank of node j from the perspective of node i. To model non-continuous relations, the network model in (1) can be generalized by modeling yi,j = ℓ(zi,j) where zi,j is a continuous measure of the pairwise relation and ℓ is a function defining the relationship between zi,j and yi,j. The latent continuous network measure zi,j is then modeled using the network model in (1) in place of yi,j. In the case of binary data, a thresholding function may be appropriate (corresponding to the probit model), and in the ordinal case the ordered probit can be considered. Hoff et al. (2013) discusses additional ℓ functions which account for censoring of binary and ordinal relations when nodes are restricted on the number of relations they can send (i.e. the number of non-zero relations in a row of Y). We employ one of these link functions in Section 6 to analyze fixed rank nomination data from the National Longitudinal Study of Adolescent Health using the proposed testing and joint modeling methodology introduced here.

3 Testing for dependencies

The goals in an analysis of network and attribute data are often threefold: 1) to determine whether dependencies exist between the network and attributes, 2) to model and estimate these dependencies, and finally 3) to make inference and possibly make predictions for missing data. The first step in any such analysis is to formally test for dependencies between the network and attributes. In this section, we describe a novel, simple approach to performing such a test.

A classical approach to determining whether there is an association between the nodal attributes Xn×p and network relations Yn×n would be to test whether dependencies exist between X and the rows of Y or between X and the columns of Y. This would involve hypothesizing that each attribute is uncorrelated with each node’s outgoing relations (H0: Cov(X[, i], Y [j,]) = 0 for all i,j) or incoming relations (H0: Cov(X[, i], Y [,j]) = 0 for all i,j) and investigating the evidence against these claims. However, conventional multivariate analysis tests are not applicable to these problems since these tests address relationships between p + n variables based on n observations.

We propose an alternative testing approach using the the estimated latent network factors Nn×(2k+2) = [a, b, U, V] from the network model in (1). The nodal attributes X = [x1, …, xn]T are independent of the network Y if and only if the attributes are independent of any function of the network. As described in Section 2, the network factors N provide a simplified representation of the network. Thus, we propose testing for dependence between the latent network factors N and attributes X on the basis that rejecting such a test would imply dependence between the network Y and attributes X (see Figure 1). However, the latent network factors N are not observed so in practice they must be estimated from the observed network Y. In this section we present a test for dependence between the estimated network factors and attributes that borrows from classic multivariate testing theory, discuss invariances in the test, and describe an exact likelihood ratio testing procedure. We also discuss alternative interpretations of the test that do not involve distributional assumptions on both the latent factors and attributes.

Suppose the nodal attributes {xi : i ∈ {1, …, n}} are continuous and mean-zero, and let ni=(ai,bi,uiT,υiT)T denote the (estimated) latent network factors for node i. We propose testing for linear dependence between the network factors and attributes using a classical multivariate test based on the assumption that the network factors and attributes are samples from a multivariate normal distribution:

(xiT,ai,bi,uiT,υiT)T=(xiT,niT)T~iidnormalp+2+2k((00),ΣXN=(ΣXΣX,NΣN,XΣN)). (6)

The null and alternative hypotheses for this test are

H0:ΣX,N=0vs.H1:ΣX,N0based on(6). (7)

Network model and test invariances

As mentioned in Section 2, the network model in (1) is invariant under transformations of the multiplicative latent factors. Formally, this non-identifiability can be expressed as an invariance of the probability model under transformations of network factors {ni : i ∈ {1, …, n}} by elements of group

𝒢N={GN=(I2000AT000A1):Ak×knonsingular},

which act via multiplication on the left: ni=(ai,bi,uiT,υiT)TGNni=(ai,bi,ATuiT,A1υiT)T. It would be undesirable for the test in (7) to depend on which latent factor estimates in the set

{{GNni:i{1,,n}}:GN𝒢N} (8)

are selected to represent the network. As discussed in Section 2, the probability of the network Y is the same for all latent factor estimates in this set, and hence all elements within a set are equivalent based on the likelihood. Define 𝒢 to be the extension of group 𝒢N to transformations of (xiT,niT)T:

𝒢={G=(Ip00GN):GN𝒢N},

which acts via left multiplication and leaves xi unchanged. We define 𝒢 in order to relate the invariance in the network model parameterization to the test in (7).

The testing problem in (7) is itself invariant under left multiplication of (xiT,niT)T by elements in the group ℱ, where ℱ is defined

={F=(BX00BN):Bp×pX,B(2k+2)×(2k+2)Nnonsingular}.

An ℱ-invariant test is a test for (7) that produces the same results for all attributes and network factors that are equivalent under group ℱ. Observe that 𝒢 is a subgroup of ℱ. This implies that an ℱ-invariant test will also respect the 𝒢-invariances in the network relations probability model. In other words, all latent network factor estimates in the same equivalence set given by (8) will generate the same test results for (7) under an ℱ-invariant test.

Likelihood ratio test

There is no uniformly most powerful invariant test for (7), however the likelihood ratio test is ℱ-invariant, unbiased (Perlman and Olkin (1980)), and generally performs well. Let N = [a, b, U, V] be the (n × (2k + 2)) matrix of network factors. The likelihood ratio test statistic can be written

Λ=maxΣXNL(ΣXN|N,X)maxΣX,ΣNL0(ΣX,ΣN|N,X)=i=1p^(2k+2)(1ri2)n/2 (9)

where L0 and L refer to the likelihood corresponding to the multivariate normal model in (6) with and without restricting ΣN,X = 0. The term ri2 is the ith eigenvalue of

(XTX)1/2(XTN)(NTN)1(NTX)(XTX)1/2,

and its positive square root is commonly referred to as the ith canonical correlation between N and X. This correlation represents the largest correlation obtainable between a linear combination of attributes and a linear combination of the network factors such that the linear combinations are uncorrelated with the respective combinations used to obtain the first i − 1 correlations. The test based on (9) rejects the null hypothesis for large values of Λ and was shown to have monotonically increasing power as a function of each population canonical correlation (Anderson and Gupta (1964)).

Under the null hypothesis, W = Λ−2/n has a Wilks’ Lambda U(p, 2k + 2, n − (2k + 2)) distribution, which is equivalent to the product of independent, Beta distributed random variables (Muirhead (1982)):

W~U(p,2k+2,n(2k+2))=i=1pBeta(n(2k+2)p+i2,2k+22).

The α-quantiles for this distribution can be obtained via Monte Carlo estimation and used to perform exact level-α tests for (7).

Alternative interpretation of the test

The test in (7) was derived as the likelihood ratio test for a model where both the network factors and attributes are samples from a normal distribution. However in some cases these assumptions may not be appropriate. Fortunately, alternative interpretations of the test exist that do not rely on such assumptions. The likelihood ratio test in (9) for the test in (7) is the same as the likelihood ratio test to determine whether the coefficients in a linear regression are nonzero, where either the network factors are regressed on the attributes or the attributes are regressed on the network factors. These conditional tests can be expressed

H0:βX|N=0vs.H1:βX|N0based onxi|ni~iidnormal(βX|Nni,ΣX|N),and (10)
H0:βN|X=0vs.H1:βN|X0based onni|xi~iidnormal(βN|Xxi,ΣN|X), (11)

where βX|N and βN|X are (p × (2 + 2k)) and ((2 + 2k) × p) matrices, respectively. If the nodal attributes were specified as part of the study design or are binary or ordinal, it may be inappropriate to model them as Gaussian as is done in (6). Instead it may be preferable to test for dependence between the attributes and network factors via the conditional formulation in (11), where no distributional assumptions are placed on X. The likelihood ratio test for the tests in (7), (10) and (11) are identical, so the testing framework presented is appropriate if the assumption of normality is reasonable for one or both of the network factors and attributes.

4 Simulation study

To analyze data with the test outlined in Section 3, the network latent factors N must be estimated from the observed network Y. We expect this to result in a decrease in the power of the test in (7) compared to if the network factors were able to be observed directly. Furthermore, we expect a greater decrease in power when the observed network relations are less informative (i.e. binary rather than continuous). In this section we present a simulation study that quantifies the degree to which power is lost when the network factors are not observed and must be estimated from observed network relations.

Consider the network model in (1) with one multiplicative effect (k = 1), zero mean (μ = 0), and independent standard normal errors (ρ = 0, σe2=1):

yi,j=ai+bj+uiυj+ei,j,ai,bj,ui,υj,ei,j~normal(0,1). (12)

We consider the case where one nodal attribute is of interest (p = 1) and the attribute and latent network factors have one of the following covariance structures:

  1. Cov[(xi, ai, bi, ui, υi)] = ΣXN = I + γEx,a,

  2. Cov[(xi, ai, bi, ui, υi)] = ΣXN = I + γEx,u.

Ex,a is the (5 × 5) matrix of zeros with a one in the entires corresponding to Cov[x, a] and Cov[a, x], and Ex,u is defined analogously. In scenario A the attribute and each network factor are uncorrelated, except the additive sender factor ai and the attribute xi which have correlation γ. Similarly, in scenario B correlation γ exists between the sender multiplicative factor ui and the attribute xi.

Monte Carlo estimates of the power based on the level-0.05 likelihood ratio test in (9) for the test in (7) were computed for squared correlation values γ2 ∈ {−0.05, 0, 0.05, 0.1, 0.15, 0.2}, network sizes n ∈ {25, 50, 100} and three decreasingly informative observations of the network:

  1. N = [a, b, U, V] is observed;

  2. N is estimated from a continuous network Y according to (1);

  3. N is estimated from a binary network Bd, where Bd is defined as Bd = {bi,j : bi,j = 1 if yi,j > yd, 0 otherwise} and yd is chosen such that the proportion of network relations greater than yd (i.e. the network density) is d.

Notice that the binary network Bd is a deterministic function of the continuous network Y. We consider the binary networks with density 0.5 and 0.15. The former case represents a relatively dense binary network with many observed relations, whereas the latter case reflects more common network seen in survey data where information about only a small number of nodes’ relations are available. For the continuous network Y and binary networks Bd, the Bayesian estimation procedure was used to obtain estimates of the latent network factors. A probit function was specified for the binary networks. The additive factors a and b were estimated by their posterior means, and the multiplicative factors U and V were estimated by the first left and right singular vector of the posterior mean of the multiplicative effect UVT in (4).

Figure 2 shows the power estimates for the two covariance structures A and B and the four network observations (N, Y, B0.5, B0.15) based on 2,000 simulations. A single power curve is shown for the latent network factors N since the correlation structures A and B are equivalent with respect to the invariances of the test in (7). Most notably, Figure 2 illustrates there is relatively little power lost when the network factors are estimated from an observed continuous or binary network, even when the network size is small. The power of the test is slightly larger when dependence exists between the attribute and an additive factor compared to when it exists between the attribute and a multiplicative factor for continuous and binary network observations. This is likely a consequence of the relative ease with which additive effects are estimated compared to interaction effects. As expected, the power of the test decreases as the observed network information becomes less informative (NYB0.5B0.15), although for even moderate network sizes the power loss is negligible.

Figure 2.

Figure 2

Power when testing for independence between a single attribute xi and network factors {ai, bi, ui, ρi} based on four types of network observations (latent network factors N, continuous network Y, binary network B0.50, binary network B0.15).

5 Joint model for the network and nodal attributes

If the test in Section 3 rejects the null hypothesis of independence between the attributes and network factors, there is often interest in estimating and making inference on the dependencies, as well as predicting missing network and attribute information. Addressing such inference objectives requires joint modeling of the network Y and attributes X. We propose jointly modeling the network Y and attributes X via a model composed of the network relations model in (1) and (2), and the latent factor and attribute model in (6). For completeness, we include all components of the joint model below.

yi,j=μ+ai+bj+uiTυj+ei,j (13)
(ei,j,ej,i)T~iidnormal2(0,σe2(1ρρ1)) (14)
(xiT,ai,bi,uiT,υiT)T=(xiT,niT)T~iidnormalp+2+2k((00),ΣXN=(ΣXΣX,NΣN,XΣN)) (15)

Inference for the dependence and conditional dependencies between the attributes and network is based on the covariance matrix ΣXN.

Simplified parameterization

The non-identifiability of the latent factors discussed in Sections 2 and 3 translates to non-identifiability of portions of the covariance matrix ΣXN. However, by restricting the covariance matrix to have specific structure, the 𝒢-invariance of the network model due to the multiplicative latent factors can be removed.

We propose reparameterizing the model for the latent factors and attributes in (15) by

(xiT,ai,bi,uiT,υiT)~iidnormalp+2+2k((000),ΣXN=(ΣXabΣXab,UΣXab,VΣU,XabDΣU,VΣV,XabΣV,UD)), (16)

where D is a diagonal matrix with decreasing elements along the diagonal. The joint model defined by (13), (14), and (16) is not invariant to transformations of the network factors and attributes by elements in the group 𝒢, however it continues to possess non-identifiability with respect to signs of the entries in U and V. Specifically, the probability of the observed network Y and attributes X is the same with parameters {U, V, ΣXN} as it is with parameters

{US,VS,(Ip+2000S000S)ΣXN(Ip+2000S000S)}, (17)

where Sk×k is any diagonal matrix with ±1’s along the diagonal. Similar to the multiplicative network factors estimation in Section 2, inference for the parameters’ relative likeliness in this model should be over equivalence classes defined by the relations in (17).

Relation to reduced rank regression

The expectation of the network relations conditional on the attributes based on (13) resembles that of a reduced rank regression on (multiplicative) attribute interaction effects. This is noteworthy as the motivations underlying reduced rank regression parallel many of the motivations for this network modeling framework.

The expectation of the network factors conditional on the attributes can be written

E[(ai,bi,uiT,υiT)T|xi]=(βa|Xxi,βb|Xxi,(βU|Xxi)T,(βV|Xxi)T)T,

where βa|X, βb|X are (p × 1) vectors and βU|X and βV|X are ((2 + 2k) × p) matrices of coefficients based on ΣXN. Since the latent factors for different nodes are modeled as independent, the expectation of the network relations in (13) conditional on the attributes is

E[yi,j|xi,xj]=μ+βa|Xxi+βb|Xxj+xiTβU|XTβV|Xxj.

The interaction term xiTβU|XTβV|Xxj represents a linear combination of all possible pairwise products between the p sender and p receiver attributes, resulting in p2 linear effects. The coefficients on these linear effects are given by the (k × k) matrix βU|XTβV|X, whose rank is at most equal to the minimum of p and k. Therefore, if the number of attributes is greater than the number of multiplicative network factors (p > k), linear constraints will exist among the p2 effect coefficients. In reduced rank regression the coefficient matrix corresponding to the regression of a multivariate outcome on a multivariate predictor is restricted to be reduced rank (Anderson (1951), see Reinsel and Velu (1998) for a comprehensive review). This approach to parameter dimension reduction is motivated by improvement in parameter estimation and interpretation. A similar goal exists in network modeling and is achieved here using the latent network factors. Modeling dependencies between the latent network factors N and attributes X instead of between the network relations Y and attributes X directly allows us to parsimoniously estimate and characterize complex (multiplicative) dependencies without defining a complicated regression model for the network relations. This approach is especially advantageous when the number of attributes is large and/or it is likely at most a small number of attribute pairs are related to the network.

Estimation

Estimation of the parameters in the joint network and attribute model is straightforward in a Bayesian context, where inference is based on the joint posterior distribution of the network factors {a, b, U, V} and parameters {σe2, ρ, ΣXN} given the data {X, Y}. Since an analytic expression of the posterior distribution is not available, it is approximated by samples generated from a Markov chain Monte Carlo (MCMC) algorithm. The MCMC procedure implemented in the R package ‘amen’ for the model described in Section 2, where the additive and multiplicative factors are uncorrelated, was adapted for the joint model presented here. Details regarding the families of prior distributions considered and the corresponding MCMC algorithm are included in the appendix. Code is provided at the corresponding author’s website.

6 Analysis of AddHealth data

We consider data from a survey of 389 high-school students from the National Longitudinal Study of Adolescent Health (AddHealth) (Harris et al. (2009)) and investigate whether evidence exists that student friendships are related to student health behaviors and grade point average (GPA). The data we use includes same-sex friendship nomination data, whereby students identified the top five friends of their sex, in addition to demographic and behavioral information. The data considered here can be described as follows:

  • network information - R = {ri,j}: ri,j is the rank of student j in student i’s listing of friends (5 = highest, 1 = lowest) or 0 if student i did not list student j;

  • nodal attributes - X = [xexercise, xdrink, xgpa]: standardized measures of exercise frequency, drinking frequency, and grade point average;

  • nodal covariate - W = [wgrade]: student grade level (9, 10, 11, or 12).

Students in the same grade and adjacent grades are more likely to be friends than students many grades apart. For this reason, we refine our question of interest to be whether students’ attributes (exercise, drinking, and GPA) are associated with their network relations’ while controlling for their grade.

We use the fixed rank nomination likelihood introduced in Hoff et al. (2013) to model the observed network ranks and restriction that at most five friends could be listed on the survey. This likelihood assumes each observed network relation ri,j is the function of an underlying (latent) continuous measure zi,j such that the following relation consistencies are satisfied:

ri,j>0zi,j>0,ri,j>ri,kzi,j>zi,k,ri,j=0and studentilisted<5friendszi,j0. (18)

The first association is the function used in probit regression which assumes that if a friendship is reported, the latent friendship value must exceed a given threshold (in this case 0). The second relation assures consistency of the ranks with the latent friendship measures. Finally, the last association posits that friendships between a given student and all students he/she did not list as a friend must be below the friendship threshold if the nominating student listed fewer than five friends.

The network model for the latent relations zi,j is that given in (1) with additional regression terms for whether students are in the same grade wi,js and whether they are in adjacent grades wi,ja

zi,j=μ+βswi,js+βawi,ja+ai+bj+uiTυj+ei,j,ai,bi,ui,υik. (19)

Selection of factor dimension k

The multiplicative factor dimensions k for the male and female networks were determined using a method analogous to the scree plot method which is commonly used in factor analysis and principal components analysis. The network model in (18) and (19) was fit to each gender network with k = 8. Let M denote the posterior mean estimate of the multiplicative network effect UVT, and represent the rank eight matrix approximation of M based on the singular value decomposition. The total variation in is equal to the sum of the squared singular values: M^F2==18λ2, where ‖·‖F denotes the Frobenius norm and λi is the ith singular value. Figure 3 shows the proportion of the total variation in attributed to each multiplicative effect (i.e. λi2/=18λ2). For both the male and female network the large majority of the variation in the network relations explained by the eight multiplicative effects is associated with the first three effects. Thus, the multiplicative effect dimension was selected to be three for both networks.

Figure 3.

Figure 3

Proportion of variation in the posterior mean eight factor multiplicative effect that is explained by each multiplicative effect.

Testing for dependence

As discussed in the Introduction, a traditional approach to modeling dependence between the network relations and nodal attributes would be to include regression terms in the form of sender, receiver, and interaction effects for the three attributes. Including all such effects using this approach would require 15 regression terms. However, by performing the test of independence proposed in Section 3 based on the latent network factors, we are able to assess the evidence for any relationship between the attributes and network without creating a potentially unnecessarily complex network model or performing any model selection.

The latent network factors for the model in (18) and (19) with k = 3 were estimated for the male and female networks. The additive factors a and b were estimated by their posterior means, and the multiplicative factors U and V were estimated by the first three left and right singular vectors of the posterior mean of the multiplicative effect UVT. The test of independence between the network factors and the three nodal attributes for the female and male network resulted in p-values < 0.001. Therefore, based on a 0.05 level test, we reject the null hypothesis of independence between the student attributes and their network relations after accounting for grade structure.

Jointly modeling the network and attributes

The rejections of the tests of independence between the attributes and network suggests the network factors are informative for nodal attribute data. To investigate this claim we performed a 20-fold cross validation on each sex dataset in which 5% of data for each attribute was treated as missing in each experiment. We compared predictions for the missing attributes based on the observed attributes alone to predictions based on both the network and observed attributes. The predictions based solely on the attributes were the fitted values from a regression of each attribute on all other attributes. The predictions based on the network and attributes were the posterior mean estimates from the Bayesian estimation procedure for the joint network and attribute model introduced in Section 5. For each sex dataset, a Markov chain was run for 500 iterations of burn-in followed by an additional 500,000 iterations and samples were thinned to every 25th iteration, resulting in 20,000 simulated values for each missing element. The average effective sample size was 2,607 for the male network and 734 for the female network.

Table 1 shows the mean squared error over the 20 cross validations for each attribute and each sex dataset. The predictions based on the network and attributes improved upon the predictions based on the attributes alone for both sexes and all attributes. The improvement was greatest for male drinking frequency and female GPA where prediction mean squared error was reduced by about 15%. This illustrates that when dependence exists between the network and attributes, improvements in the predictions of missing values can be obtained by using both the network and attribute information.

Table 1.

Mean squared error for predictions from 20-fold cross validation.

Males Females
Method Exercise Drinking GPA Exercise Drinking GPA
Regression (attributes only) 1.89 3.24 2.38 1.67 2.38 2.29
Joint model (attributes & network) 1.75 2.69 2.18 1.61 2.17 1.93

% improvement 7.4 17.0 8.4 3.6 8.8 15.7

7 Discussion

In this article we introduced an original approach to testing whether dependencies exist between a network and attribute data, and a joint modeling framework for the network and attributes, which relies on a simplified representation of the network in terms of latent node-specific factors. As discussed in the Introduction, many others have proposed methodology for investigating the relationship between network and attribute data. The most common methods involve regressing either the network on functions of the nodal attributes or each node’s attributes on functions of the attributes of the node’s neighbors in the network. Frequently final models are settled upon after some, often undocumented, model selection procedure which not only alters the interpretation of the results, but also adds an additional element of subjectively to the analysis. The key distinction between the previous methods and those presented here is that our method does not involve any model selection procedures and simultaneously tests and estimates first and second order dependencies between the network and attributes.

Although the methods presented here are able to address questions about association between attributes and network relations based on a single cross-sectional dataset, the ultimate goal of much research in the social and biological sciences is to determine causal relationships using temporal network and attribute data. For example, sociologists are interested in determining whether students’ friendships impact their health behaviors or whether health behaviors cause changes in their friendship networks (Bauman and Ennett (1996)). Snijders et al. (2007) and Steglich et al. (2010) propose methods for modeling the co-evolution of networks and behaviors using actor-oriented processes whereby actors dictate changes in their behaviors and outgoing ties. These methods focus on binary data, rely on method of moments estimation procedures, and are not applicable to cross-section datasets since they model network and attribute changes conditional on the first observation. A key advantage of the latent variable model approach here is that it warrants a formal testing procedure void of any model selection procedures. Extending these methods to allow for testing of causal relationships between a network and attributes is a future research area of the authors.

A historically difficult problem not addressed here is how to select the number of multiplicative factors in the network model. In Section 6 we illustrated a procedure similar to the scree plot method used frequently to choose the number of factors in factor analysis and the number of eigenvectors in principal component analysis. An alternative approach would be to incorporate the dimension selection into the model by placing a prior on the number of factors, similar to that proposed in Hoff (2007) for the singular value decomposition, and simultaneously estimate the multiplicative latent factor dimension along with the other parameters. However this would greatly increase the complexity of the model and computation time of estimation.

Acknowledgments

This work was partially supported by NICHD grant R01 HD-067509. The authors thank Alex Volfovsky for his helpful comments and discussion.

Appendix

A Bayesian estimation procedure

In this section we outline the Bayesian estimation procedure used to obtain parameter estimates for the joint attribute and network model in (15). This procedure is extremely similar to that implemented in the ‘amen’ package in the statistical computing program R. We present the simple case here where the observed network Y is continuous, there are no regression terms in the network model, and there is no missing data. For details on accommodating non-continuous network data see Hoff et al. (2013) and for including regression terms see Hoff (2005).

Model -

yi,j=μ+ai+bj+uiTυj+ei,j,(ei,j,ej,i)T~iidnormal2(0,σe2(1ρρ1))(xiT,ai,bi,uiT,υiT)T~iidnormalp+2+2k(0,ΣXN)

Prior distributions -

σe2~gamma(1/2,1/2)ρ~uniform(1,1)ΣXN1~Wishart(p+2+2k+1,(ΣX0100I2+2k))

Markov chain Monte Carlo algorithm -

Given initial values of all latent variables {a, b, U, V} and parameters {ΣXN, ρ, σe2}, the algorithm proceeds as follows:

  1. Sample a, b|Y, X, U, V, ΣXN, ρ, σe2 (normal).

  2. Sample ΣXN|Y, X, a, b, U, V, ρ, σe2 (inverse-Wishart).

  3. Update ρ using a Metropolis-Hastings step with proposal ρ*|ρ ~ truncated normal[−1,1] (ρ, σe2);

  4. Sample σe2|Y, X, a, b, U, V, ρ, ΣXN (inverse-gamma).

  5. For each latent factor i:
    • Sample U[, i]|Y, X, a, b, U[, −i], V, ρ, σe2, ΣXN (normal);
    • Sample V [, i]|Y, X, a, b, U, V [, −i], ρ, σe2, ΣXN (normal).

Although the estimation algorithm is not constructed based on the unique parameterization of the model, each sample of network factors from the posterior distribution can be transformed using the covariance matrix ΣXN sample to represent a sample from (16). Inference for the relative likeliness of parameter values is based on the posterior distribution over the parameter equivalence classes associated with representations congruent with (17).

Contributor Information

Bailey K. Fosdick, Department of Statistics, Colorado State University

Peter D. Hoff, Department of Statistics and Biostatistics, University of Washington

References

  1. Anderson TW. Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics. 1951;22(3):327–351. [Google Scholar]
  2. Anderson TW, Gupta SD. A monotonicity property of the power functions of some tests of the equality of two covariance matrices. Annals of Mathematical Statistics. 1964;35(3):1059–1063. [Google Scholar]
  3. Austin A, Linkletter C, Wu Z. Covariate-defined latent space random effects model. Social Networks. 2013;35(3):338–346. [Google Scholar]
  4. Azari H, Airoldi E. Graphlets decomposition of a weighted network; Journal of Machine Learning Research, W&CP, 22 (Proceedings of the 15th Internatioanal Conference on Artificial Intelligence and Statistics AISTATS; 2012. pp. 54–63. [Google Scholar]
  5. Bauman KE, Ennett ST. On the importance of peer influence for adolescent drug use: commonly neglected considerations. Addiction. 1996;91(2):185–198. [PubMed] [Google Scholar]
  6. Butland G, Peregrín-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature. 2005;433(7025):531–537. doi: 10.1038/nature03239. [DOI] [PubMed] [Google Scholar]
  7. Christakis NA, Fowler JH. The spread of obesity in a large social network over 32 years. New England Journal of Medicine. 2007;357(4):370–379. doi: 10.1056/NEJMsa066082. [DOI] [PubMed] [Google Scholar]
  8. Cohen-Cole E, Fletcher JM. Is obesity contagious? social networks vs. environmental factors in the obesity epidemic. Journal of Health Economics. 2008;27(5):1382–1387. doi: 10.1016/j.jhealeco.2008.04.005. [DOI] [PubMed] [Google Scholar]
  9. Erbring L, Young AA. Individuals and social structure: Contextual effects as endogenous feedback. Sociological Methods & Research. 1979;7(4):396–430. [Google Scholar]
  10. Faust K. Comparison of methods for positional analysis: Structural and general equivalence. Social Networks. 1988;10(4):313–341. [Google Scholar]
  11. Fellows I, Handcock MS. Exponential-family Random Network Models. arXiv preprint:1208.0121. 2012 [Google Scholar]
  12. Frank O, Strauss D. Markov graphs. Journal of the American Statistical Association. 1986;81(395):832–842. [Google Scholar]
  13. Handcock MS. Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers. National Academies Press; 2003. Statistical models for social networks: Inference and degeneracy; p. 229. [Google Scholar]
  14. Harris K, Halpern C, Whitsel E, Hussey J, Tabor J, Entzel P, Udry J. The national longitudinal study of adolescent health: Research design. 2009 URL: http://www.cpc.unc.edu/projects/addhealth/design. [Google Scholar]
  15. Hoff PD. Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Association. 2005;100(469):286–295. [Google Scholar]
  16. Hoff PD. Model averaging and dimension selection for the singular value decomposition. Journal of the American Statistical Association. 2007;102(478):674–685. [Google Scholar]
  17. Hoff PD. Multiplicative latent factor models for description and prediction of social networks. Computational & Mathematical Organization Theory. 2009;15(4):261–272. [Google Scholar]
  18. Hoff PD, Fosdick BK, Volfovsky A, Stovel K. Likelihoods for fixed rank nomination networks. To appear in Network Science. 2013 doi: 10.1017/nws.2013.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hoff PD, Raftery AE, Handcock MS. Latent space approaches to social network analysis. Journal of the American Statistical Association. 2002;97(460):1090–1098. [Google Scholar]
  20. Hunter DR, Handcock MS. Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics. 2006;15(3):565–583. [Google Scholar]
  21. Kim M, Leskovec J. Modeling social networks with node attributes using the multiplicative attribute graph model. arXiv preprint:1106.5053. 2011 [Google Scholar]
  22. Kim M, Leskovec J. Multiplicative attribute graph model of real-world networks. Internet Mathematics. 2012;8(1–2):113–160. [Google Scholar]
  23. Lyons R. The spread of evidence-poor medicine via flawed social-network analysis. Statistics, Politics, and Policy. 2011;2(1) [Google Scholar]
  24. Marsden PV, Friedkin NE. Network studies of social influence. Sociological Methods & Research. 1993;22(1):127–151. [Google Scholar]
  25. McFarland DD, Brown DJ. EO Laumann. Bonds of Pluralism: The Form and Substance of Urban Social Networks. New York: John Wiley; 1973. Social distance as metric: A systematic introduction to smallest space analysis; pp. 213–252. [Google Scholar]
  26. Muirhead R. Aspects of multivariate statistical theory. Wiley; 1982. [Google Scholar]
  27. Perlman MD, Olkin I. Unbiasedness of invariant tests for manova and other multivariate problems. Annals of Statistics. 1980;8(6):1326–1341. [Google Scholar]
  28. Reinsel G, Velu R. Lecture Notes in Statistics. Springer; 1998. Multivariate Reduced-Rank Regression: Theory and Applications. [Google Scholar]
  29. Robins G, Pattison P, Elliott P. Network models for social influence processes. Psychometrika. 2001;66(2):161–189. [Google Scholar]
  30. Schweinberger M. Instability, sensitivity, and degeneracy of discrete exponential families. Journal of the American Statistical Association. 2011;106(496):1361–1370. doi: 10.1198/jasa.2011.tm10747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Snijders TA, Steglich CE, Schweinberger M. Modeling the co-evolution of networks and behavior. In: van Montfort K, Oud H, Satorra A, editors. Longitudinal models in the behavioral and related sciences. Mahwah, NJ: Lawrence Erlbaum; 2007. pp. 41–71. [Google Scholar]
  32. Snijders TAB, Pattison PE, Robins GL, Handcock MS. New specifications for exponential random graph models. Sociological Methodology. 2006;36(1):99–153. [Google Scholar]
  33. Steglich C, Snijders TA, Pearson M. Dynamic networks and behavior: Separating selection from influence. Sociological Methodology. 2010;40(1):329–393. [Google Scholar]
  34. Veenstra R, Dijkstra JK, Steglich C, Van Zalk MH. Network-behavior dynamics. Journal of Research on Adolescence. 2013;23(3):399–412. [Google Scholar]
  35. Warner R, Kenny D, Stoto M. A new round-robin analysis of variance for social interaction data. Journal of Personality and Social Psychology. 1979;37(10):1742–1757. [Google Scholar]
  36. Wasserman S, Pattison P. Logit models and logistic regressions for social networks: I. An introduction to markov graphs and p*. Psychometrika. 1996;61(3):401–425. [Google Scholar]
  37. Wong GY. Round-robin analysis of variance via maximum likelihood. Journal of the American Statistical Association. 1982;77(380):714–724. [Google Scholar]

RESOURCES