Abstract
We describe semiparametric estimation and inference for causal effects using observational data from a single social network. Our asymptotic results are the first to allow for dependence of each observation on a growing number of other units as sample size increases. In addition, while previous methods have implicitly permitted only one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties and for dependence due to latent similarities among nodes sharing ties. We propose new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure. We use our methods to reanalyze an influential and controversial study that estimated causal peer effects of obesity using social network data from the Framingham Heart Study; after accounting for network structure we find no evidence for causal peer effects.
Keywords: Statistical dependence, Causal inference, Social networks, Semiparametric inference
1. Introduction
There is increasing interest in identifying and estimating causal effects in the contexts of social networks, that is, causal effects that one individual’s behavior, treatment assignment, beliefs, or health outcome could have on their social contacts’ behaviors, exposures, beliefs, or health statuses. But methodology has not kept apace with interest in causal inference using data from individuals connected in a social network, and many researchers have resorted to using inappropriate statistical methods to analyze this new type of data. There have been a number of high profile articles that use standard methods like generalized linear models (GLM) and generalized estimating equations (GEE) to attempt to infer causal peer effects from network data (e.g. Christakis and Fowler, 2007, 2008, 2010), and this work has inspired several research programs that study peer effects using the same statistical methods (Ali and Dwyer, 2010; Cacioppo et al., 2009; Madan et al., 2010; Rosenquist et al., 2010; Wasserman, 2013). These methods have come under considerable criticism from the statistical community (Cohen-Cole and Fletcher, 2008; Lyons, 2011; Shalizi and Thomas, 2011), in part because these statistical models are not equipped to deal with dependence across individuals (Ogburn and VanderWeele, 2014).
Recently, many researchers have developed methods specifically designed to deal with interference–the effect that one subject’s treatment may have on others’ outcomes–and other forms of causal dependence among subjects. However, the methods developed in this context generally require observing multiple independent groups of units, i.e. multiple independent networks, or else they require treatment to be randomized. Ideally, we would like to be able to perform inference even when all observations are sampled from a single social network and when exposures cannot be randomized. Aside from van der Laan (2014), upon which our work builds, Tchetgen Tchetgen et al. (2020) is the only other proposed solution to this problem of which we are aware. The approach of Tchetgen Tchetgen et al. (2020) is quite different from ours, primarily because it assumes that the outcomes of interest comprise a single realization of a specific type of Markov random field over the network. This corresponds to certain types of equilibrium distributions and is incompatible with the traditional causal data generating mechanisms that we work with in this paper, namely causal structural equation models and directed acyclic graph (DAG) models (for a discussion of these compatibility issues see Lauritzen and Richardson, 2002; Ogburn et al., 2020).
We build upon recent work by van der Laan (2014) that proposed estimators for traditional causal estimands using data from a single collection of interconnected units in which each unit is independent of all but a small number of other units, with asymptotic results relying on the number of dependent units being fixed as the total number of units goes to infinity and with dependence due solely to direct transmission. We prove a new central limit theorem for network dependent data, an implication of which is that the estimators proposed by van der Laan (2014) are consistent and asymptotically normal under more realistic forms of dependence and more general asymptotic regimes. As far as we are aware our methods are the first to accommodate latent, unstructured dependence; previous methods (including van der Laan, 2014 and Tchetgen Tchetgen et al., 2020) are appropriate when all dependence is due to observable direct transmission of information, treatments, or outcomes along network ties. This is a crucial contribution as latent variable dependence is likely to be present in all but the most pathological social network settings. We also introduce novel causal estimands that are specifically of interest in network settings and propose the first conditions that permit causal inference for interventions on network structure. Together these contributions comprise a broad framework for causal inference using observational data from a single social network. In this paper we focus on the theory and only briefly touch on estimation; details regarding the implementation and computation of our estimation procedure can be found in a series of companion papers focused on implementation and computation (Sofrygin et al., 2017, 2018; Sofrygin and van der Laan, 2017). An R package is also available (Sofrygin and van der Laan, 2015).
2. Background and setting
2.1. Motivating example
In order to demonstrate the importance of principled methods designed to handle the complexity of observational social network data, in Section 6 we reanalyze the Framingham Heart Study data used in Christakis and Fowler (2007), which purported to find evidence that obesity is socially contagious. The FHS is an epidemiologic cohort study that was originally designed to study cardiovascular epidemiology and is comprised of over 15, 000 participants from the town of Framingham, Massachusetts and neighboring communities. In addition to its important role in epidemiology, the FHS plays a uniquely influential role in the study of social networks and peer effects. In the early 2000s, researchers Christakis and Fowler (CF) discovered an untapped resource buried in the FHS data collection tracking sheets: information on social ties that, combined with existing data on connections among the FHS participants, allowed them to reconstruct the (partial) social network underlying the cohort. They then leveraged this social network data to study peer effects for obesity (Christakis and Fowler, 2007), smoking (Christakis and Fowler, 2008), and happiness (Fowler and Christakis, 2008). Researchers have since used the same methods as Christakis and Fowler to study peer effects in the FHS and in many other social network settings (e.g. Trogdon et al., 2008; Fowler and Christakis, 2008; Rosenquist et al., 2010).
We will refer to a toy example based on FHS throughout. For the purposes of our example we will assume that Framingham is a closed community and that we have data on all members of the community and knowledge of their connections with one another. Suppose, as has been suggested by some researchers (Fowler and Christakis, 2008), that happiness exhibits peer effects, and let Y be a continuous measure of happiness, measured simultaneously on each of the n individuals in the FHS network. Below we will describe causal estimands of interest in the context of this motivating example.
2.2. Networks and causal estimands
A network is a collection of units, or nodes, and information about the presence or absence of pairwise ties between them. The presence of a tie between two units indicates that the units share some kind of a relationship; for example in the FHS ties include familial relatedness, friendship, and shared place of work. Some types of relationships, like familial relatedness, are mutual; others, like friendship, may go in only one direction. For simplicity we will assume all networks are undirected in what follows, but our methods are equally applicable to directed networks. In an undirected network, the degree of a node is the number of ties it has. The alters of node i are the nodes with which i shares ties. Let Aij ≡ I {subjects i and j share a tie} and Aii = 1. The matrix A with entries Aij is the adjacency matrix for the network. We will assume throughout that each node is associated with a vector of random variables, Oi, including an outcome Yi, covariates Ci, and an exposure or treatment variable Xi, all possibly indexed by time t. Throughout we will use bold letters to represent n-vectors of random variables, e.g. Y denotes (Y1, …, Yn).
In applications across the social, political, and health sciences, researchers are interested in ascertaining the presence of and estimating causal interactions across alter-ego pairs. Is there interference, i.e. does the treatment of subject i have a causal effect on the outcome of subject j when i and j share a network tie? Are there peer effects, i.e. does the outcome of subject i have a causal effect on its alters’ future outcomes? Implicit in many of these causal inquiries is the notion that causal effects only transmit along network ties. It is crucial that the network be completely and accurately measured, since missing ties can result in unmeasured confounding or missing exposure variables. For example, Ci, e.g. i’s baseline attitude about mental health, may be a confounder not just for the effect of Xi on Yi but also for the effect of Xj on Yj if it affects j’s treatment uptake and j’s happiness. If the existence of a tie between units i and j is missing in our data, we may fail to control for confounders Ci of causal effects on Yj. We could also fail to account for elements Xi of the exposure affecting Yj, resulting in exposure misclassification.
Let Yi(X*) denote the potential or counterfactual outcome of individual i in a hypothetical world in which the exposure vector X was set to x*, e.g. the happiness outcome we would have observed for individual i if all FHS subjects had received meditation training. We index potential outcomes with the vector of exposures because, in contrast with i.i.d. settings, Yi may be causally affected by the exposures of other nodes. Additionally, note that these potential outcomes may not be identically distributed. We will focus throughout on network-wide estimands of the form , where : the expected potential outcome, averaged across network nodes, in the hypothetical world where X is set to x*. The distribution of Yi(x*) depends on which nodes j are causally related to i – that is, it depends on the observed adjacency matrix A and, through that, on n. Therefore, all of our estimands and estimators condition on the adjacency matrix A and on n. This same conditioning is implicit in other work on causal inference for data from a single network, since no methods currently exist to jointly model the network structure given by A and the random variables associated with the nodes, but this is rarely made explicit. Estimands of the form are data-dependent or data adaptive (Hubbard et al., 2016), meaning that their value depends on the observed data sample through A and n. This changes but does not undermine their interpretation. Consider a hypothetical intervention to make FHS participants happier. Using data from FHS we may not be able to learn about the expected average potential outcome under this intervention for the U.S. population (unless the effect of the intervention does not depend on n or A), but we may be able to learn about the expected average potential outcome under this intervention for other mid-sized, middle class, New England suburbs.
We will also, at times, discuss causal estimands conditional on covariates C, e.g. . As n → ∞, , however for finite n conditional estimators have smaller variance and inference about the conditional estimand cannot be interpreted as inference about the marginal estimand. Inference about conditional estimands could be informative about other, similar networks that share a common distribution of C. For example, if C is age, then we might be able to use inference about to learn about the expected average potential outcome of the intervention in other mid-sized, middle class New England suburbs with a similar age distribution to Framingham. If C is not an effect modifier, then causal effects are the same conditional on or marginalized over C and conditioning on C does not limit our ability to generalize causal effects.
We consider three families of causal interventions in this paper: static, dynamic, and stochastic. Static interventions set Xi to a user-specified, fixed value x* that does not depend on unit i’s characteristics. The literature on causal inference with interference has largely focused on static interventions to date. This family includes, for example, the average potential outcome in a world in which everyone received exposure X = 1 compared to a world in which everyone received exposure X = 0. In the interference literature, this effect is known as an overall effect (Hudgens and Halloran, 2008; Tchetgen Tchetgen and VanderWeele, 2012). It also includes versions of direct and indirect effects, which quantify the effect of a unit’s own treatment holding other treatments constant and the effect of others’ treatments holding a unit’s own treatment constant, respectively, though the most common definition of direct and indirect effects requires the presence of multiple independent groups of individuals and will not apply in our single network setting. A dynamic intervention assigns exposures as a user-specified, deterministic function of covariates C, e.g. an FHS happiness intervention that assigns subjects to talk therapy if their baseline happiness is below a certain threshhold. Exposure is deterministically specified conditional on covariates but is but allowed to depend “dynamically” on covariates. A stochastic intervention (Haneuse and Rotnitzky, 2013; Muñoz and van der Laan, 2012; Young et al., 2014) assigns exposures as a user-specified, random function. This allows exposure to be a stochastic function of covariates, e.g. an FHS happiness intervention that assigns subjects to talk therapy with a probability that depends on their baseline happiness. In Section 4 we use stochastic interventions to define some kinds of interventions on network structure.
2.3. Inferential and asymptotic framework
Throughout we assume that we have observed a complete network of size n. Our methods are not appropriate for a random sample from a superpopulation because such a random sample would miss crucial confounders and exposures, as described in Section 2.2 above. The data may exhibit latent variable dependence – unstructured dependence resulting from latent traits that are more similar for observations that are close in the network than for distant observations – up to a distance of two network ties. Latent variable dependence will be present in data sampled from a network whenever observations from nodes that are close to one another are more likely to share unmeasured traits than are observations from distant nodes. Homophily, or the tendency of people who share similar traits to form network ties, is a paradigmatic example of latent variable dependence. Additionally, in networks, ties present opportunities to transmit causal effects or information, and dependence due to direct transmission will be present whenever a subject’s treatments, outcomes, or covariates may affect their alters’ treatments, outcomes, or covariates.
We will show in Section 3.2 that, despite this dependence, the causal estimands of interest are identified by functionals of the observed data distribution P(C, X, Y), which depends on n and A. This identifying functional is our target of inference. It is a parameter of the observed data distribution for a network of size n, i.e. a parameter of the data generating distribution that gave rise to the data at hand. It is an unknown parameter rather than an observed quantity because the data we see comprise a single, random draw from P(C, X, Y). In i.i.d. settings inference licenses extrapolation to populations for which the data are representative, in the sense of being drawn from (nearly) the same underlying data generating distribution. The same principle applies here: we might want to estimate a parameter of the FHS data generating distribution because we think that it is informative about the social networks found in other mid-sized, middle class New England suburbs. In i.i.d. settings, if a causal effect does not depend on a particular aspect of the data-generating distribution, then we are licensed to extrapolate to other populations that differ from the data with respect to that aspect. Similarly, here, if we have reason to believe that the causal effect of interest is homogeneous in n and relevant features of A, then we might be justified in using an analysis of the FHS data to extrapolate to other small, mid-sized, and large middle-class New England suburbs. All of our estimands and estimators condition on n and A, but this only matters for interpretation if the estimands do in fact vary with n and A.
In theory, we could learn about the parameter of interest from an infinite number of draws (each of size n) from the underlying distribution. This is the inferential framework that corresponds to most previous work on causal inference with interference and/or social network data, in which it is typically assumed that multiple i.i.d. groups or networks are observed and asymptotics are in the number of groups. In this paper we tackle the more challenging and more realistic setting of data from a single interconnected network, and we perform inference using data from a single draw of size n. We will use n → ∞ asymptotics in the service of finite sample inference. That is, we will prove that our estimators are asymptotically normal under general and relatively realistic conditions; when these conditions are met and for large enough n, this licenses the use of asymptotic approximations to perform inference the parameters indexing the data generating distribution for finite n. This, too, is analogous to the i.i.d. setting, where asymptotic results are used to license finite sample approximations for large enough n.
Research indicates that social networks often have the small-world property (sometimes referred to as the “six degrees of separation” property), meaning that the average distance between two nodes is small (Watts and Strogatz, 1998). Therefore distances in real-world networks may grow slowly (or not at all!) with sample size, making it difficult to apply existing dependent data theory and methods to this setting and necessitating a new central limit theorem that we prove in the Appendix. Formalizing asymptotic growth of network-generating models as n goes to infinity is an active area of research, especially for networks that, like social networks, have sparse limits (Caron and Fox, 2017) and is beyond the scope of this paper. Because we condition on the graph itself through the adjacency matrix A, we do not need to include the graph-generating mechanism in our asymptotic regime, obviating many of the complex issues surrounding asymptotic growth of networks. We take for granted a sequence of networks with increasing n, represented by adjacency matrices An, such that key features of the network topology, e.g. degree distribution or clustering, are preserved. The two crucial features of this sequence are (1) that same structural equation model holds for each network in the sequence and (2) that the maximum degree grows slower than rate . This asymptotic regime is consistent with realistic social networks. In particular, social networks tend to have heavy-tailed degree distributions, with most nodes having low degree but a non-trivial proportion of nodes having high degree, with the maximum degree dependent on the size of the network, resulting in asymptotically sparse networks (Newman and Park, 2003). This is incompatible with the assumption of bounded degrees, which has been used in previous methods for inference about observations sampled from a single social network. See Section 4.4 for a discussion of inference in the presence of highly connected ”hub” nodes.
2.4. Review of influence functions for non-i.i.d. data and data-dependent estimands
In order to employ semiparametric theory in this dependent data setting, we draw on classical semiparametric theory as presented in Bickel et al. (1998), and on work that extending these results to non-i.i.d. settings (Hubbard et al., 2016; Janssen and Ostrovski, 2013; McNeney and Wellner, 2000; van der Laan, 2014; van der Vaart and Wellner, 1996a). Many aspects of the classical theory, though developed and presented for i.i.d. data, are agnostic to the i.i.d. assumption. Key differences are that we index our parameters with n; we allow rates of convergence to be any function Cn of n, possibly slower than ; and we do not assume an i.i.d. CLT.
Let P be the true distribution of the observed data O = (O1, …, On) and let ℳ ∍ P be a model that may place some restrictions on P(O); ℳ could be nonparametric, semiparametric, or parametric. Let be a rich class of one-dimensional parametric submodels, indexed by a direction h across a set H, such that for all h. Let be the corresponding score function of the h-specific path. Then the pathwise derivative of parameter ψn at P along the path defined by is given by at ϵ = 0, and by the Riesz representation theorem for some mean zero function φn(O). We say that ψn is pathwise differentiable at P if this equation holds for all submodels in the class defined above, and then φn(O) is a gradient for ψn. Pathwise differentiability requires only that the pathwise derivative be a bounded linear operator from the tangent space (i.e. the closure of the linear span of the scores Sh) to the real line, and therefore does not rely on data being i.i.d.. In general each gradient for ψn is the influence function of an estimator of ψn, and in subsequent sections we will use these terms interchangeably. In a nonparametric model ℳ that places no restrictions on P(O), there can be no more than one gradient for any parameter ψn, but in semiparametric or parametric models ψn will have many gradients. If φn is a gradient for ψn in a model ℳ it will also be a gradient for ψn in any submodel of ℳ, and in particular the nonparametric gradient will be a gradient in any semiparametric or parametric model for P(O).
We say that an estimator of ψn has influence function φn if , where {Oi} represents a subset of O but may include variables indexed by j for j ≠ i in addition to those that are indexed by i. The influence function of an estimator describes its first order behavior; the limiting distribution of an asymptotically linear estimator is given by the limiting distribution of . Under a central limit theorem, where the asymptotic variance σ2 of an estimator is given by the asymptotic variance of . Note that although the parameter of interest can depend on n, its limiting distribution does not. Because an influence function depends on ψn and has mean 0 at the truth, it can often be used as an estimating function; an estimator that solves will have influence function φn (provided that estimates of any unknown nuisance parameters satisfy appropriate regularity conditions).
3. Methods
The estimating procedure that we describe in this section is based on van der Laan (2014), but we generalize the results to a broader class of causal effects, to more general and pervasive forms of dependence among observations, and to more realistic asymptotic regimes. We focus throughout on single time-point treatments. Longitudinal interventions are also possible under the theory introduced here, in particular because the longitudinal extension of our structural equation model implies that, for each time point and after conditioning on the past, the data exhibit no more dependence than the single-time point case. We leave the details for future work. We state our results under the assumption that all variables take values in discrete sets. Analogous results are valid for other types of random variables: it is straightforward to extend our notation and central limit theorem to continuous covariates and outcomes, but continuous treatments, which are not standard in causal inference, are more challenging (see van der Laan, 2014). For estimands that condition on C, our theoretical results hold immediately if C is discrete but may require additional assumptions for continuous C.
3.1. Structural equation model
We use a causal structural equation model (SEM) to define the causal effects of interest but note that analogous definitions may be achieved within the potential outcome framework (Pearl, 2012).
Let be the degree of node i. The degree of i and the degrees of i’s alters may be included in the covariate vector Ci. Define Y = (Y1, …, Yn) and C and X analogously. We assume that the data are generated by sequentially evaluating the following set of equations:
| (1) |
where fC, fX, and fY are unknown and unspecified functions that may depend on Ki and is a vector of exogenous, unobserved errors for individual i. The errors may be correlated across units, as described below. This set of equations represents an observational setting; all our results apply equally to experimental settings in which fX is constant or does not depend on C. Time ordering is a fundamental component of a structural causal model. We assume that C is first drawn for all units, so that, in addition to Ci, the other components of the vector C–corresponding to i’s social contacts–may affect the value of Xi, and similarly for X.
In addition, we make the following assumptions on the error terms from the SEM:
| (A1) |
| (A2a) |
| (A2b) |
| (A3a) |
| (A3b) |
A DAG corresponding to the SEM in (1) is given in Figure 1(a). Each node represents an n-vector; independence relationships across individuals (i.e. assumptions (A2) and (A3)) are not encoded in this DAG.
Fig. 1.

(a) DAG corresponding to the SEM in Equation (1); (b) DAG corresponding to the SEM in Equation (A6). The red arrows are deterministic.
This SEM or DAG model encodes the assumption that C suffices to control for confounding of the effect of X on Y; this is a version of the conditional ignorability or no unmeasured confounding assumption that is typically made in i.i.d. settings.
In particular, any latent variable dependence affects X and Y separately; in general a latent variable that affects X and Y jointly is an unmeasured confounder and constitutes a violation of this assumption. Assumptions (A2b) and (A3b) relax the typical assumption of independent errors to permit latent variable dependence up to a distance of two network ties. Assumption (A3) can be omitted if attention is restricted to causal effects conditional on C.
Although our main result, given in Theorem 1 below, holds for inference under assumptions (A2a)–(A3b), some estimation strategies are available only when stronger versions of assumptions (A2b) and (A3b) hold. We therefore introduce alternative assumptions
| (A4) |
| (A5) |
These assumptions, which were made in van der Laan (2014), are consistent with dependence due to direct transmission but not latent variable dependence.
In principle, the models defined by assumptions (A1)–(A3b) or by assumptions (A1), (A4), and (A5) suffice to nonparametrically identify many causal estimands of interest. However, in practice (and in order to facilitate the definition and identification of some kinds of dynamic and stochastic estimands) we may need to make simplifying assumptions on the forms of fX and fY. This is done by considering summary functions sC and sX and random variables Wi = sC, i({Cj : Aij = 1}) and Vi = sX, i({Xj : Aij = 1}) such that the model may be written as
| (A6) |
For example, implies that the exposure and outcome of node i only depend on i’s own covariate value and on the sum of the covariate values of i’s alters. Analogously, is an example of a summary function for X. It may be the case that fX and fY depend on different summary functions of C, SC, X and SC, Y respectively. In this case, without loss of generality, we define sC = W to be equal to (SC, X, SC, Y). The natural choice of summary functions may not have the same length for all i. For example, it could be natural to define the summary function sC to be Ci for units with no alters and for units with alters. In order to enforce that SC, i and SX, i have the same length for all i, we set the length of each SC, i and SX, i to its maximum over i, and fill in any empty entries with the value undefined. For convenience we use the notation SC, i(C) and SX, i(X) below; however, this notation should not undermine the important fact that Wi can only depend on the subset {Cj : Aij = 1} of C and Vi can only depend on the subset {Xj : Aij = 1} of X, as these are the only components of C and X that are parents of X and Y, respectively, in the network-as-structural-causal-model. For notational convenience, in what follows we augment the observed data random vector with Vi and Wi, recognizing that these are deterministic functionals of Ci and Xi, defined by SX,i and SC,i, and are therefore technically redundant. A DAG corresponding to the SEM in (A6) is given in Figure 1(b). Although this SEM implies that we can use W and V in place of C and X for identification and estimation of causal effects, for generality and clarity we will continue to use C and X wherever we do not require the assumption of fixed dimensionality.
Note that, although the variance-covariance structure of the SEM given in (1) is affected by the dependence allowed in (A2b) and (A3b), the mean structure is unaltered by the choice of assumptions (A2) and (A3) or (A4) and (A5), ruling out the possibility that any latent sources of dependence introduce confounding. In particular, while the SEM allows limited forms of homophily to induce dependence it rules out confounding due to homophily, where latent similarities affect both the exposure and the outcome and thereby induce confounding. Ruling this out is a strong and often unrealistic assumption (Shalizi and Thomas, 2011). Because the mean structure is unaffected by the dependence permitted by assumptions (A2) and (A3), any estimator that is unbiased under (A4) and (A5) will remain unbiased when these are relaxed to (A2) and (A3). In Section 3.2 we discuss nonparametric identification of causal parameters, which relies on the assumption of no unmeasured confounding but is agnostic to the choice of the weaker or stronger independence assumptions.
3.2. Definition and nonparametric identification of causal effects
We first define notation that we will use throughout the remainder of the paper for functionals of the distribution of O. Let pY(y | v, w) = P(Y = y | V = v, W = w) and pY, i(y | v, w) = P(Yi = y | Vi = v, Wi = w). Let pC(c) = P(C = c). We will use g to denote the propensity score distributions: g(x | w) = P(X = x | W = w) and gi(x | w) = P(Xi = x | Wi = w), and h for the corresponding distributions of W and V: hi(v | w) = P(Vi = v | Wi = w) and hi(v, w) = P(Vi = v, Wi = w). Let h* denote this distribution under the intervention of interest: and . This will be degenerate for static and deterministic interventions but not for stochastic interventions. Note that are determined by g, pC, and the user-specified intervention and is therefore an observed data quantity. Finally, is the conditional expectation of Y given V = v, W = w.
A hypothetical intervention on X replaces g(x | c) with a new, user-specified function g*; under Assumption (A6) this is equivalent to replacing h with h*. Equivalently, the intervention replaces fX in the SEM with a new, user-specified function. For example, a deterministic intervention that sets Xi to a user-given value for i = 1, …, n is given by
where . Here denotes the potential or counterfactual outcome of individual i in a hypothetical world in which P(X = x*) = 1. Analogously, is a counterfactual variable in a hypothetical world in which P(X = x*) = 1. Note that, although is counterfactual, its value is determined by the user-specified value x*, and it is therefore known. In the case of a stochastic intervention, and are random rather than fixed variables but their distributions are still known by design. In particular the distribution of is given by h*. Although any intervention on X induces an intervention on V, not all conceivable interventions on V will be compatible with the observed network. For example, suppose Vi is the average of Xi and Xj for Aij = 1, with X binary. Even if 0 and 1 are both in the support of V, it would not be possible to simultaneously assign and for i, j s.t. Aij = 1. In order to avoid such contradictions we recommend defining interventions in terms of X and g prior to expressing the induced intervention on V and h. The causal parameter of interest is the expected average potential outcome defined in Section 2.2.
In addition to the assumption of no unmeasured confounding, identification of relies on the positivity assumption that, for all i,
| (A7) |
This assumption states that, within levels of C, the values of V determined by the hypothetical intervention have positive probability under the observed data generating distribution. Now the causal parameter for a static intervention is nonparametrically identified as follows:
| (2) |
where the third equality follows from the assumption of conditional unconfoundedness. Assumption (A7) ensures that the conditioning event has positive probability and therefore that the conditional expectation is well-defined. Assumption (A6) is not required for nonparametric identification but we have used it above to simplify notation (and we will rely on it for estimation and inference). This identification result is equivalent to
| (3) |
From (2), it is clear that the conditional parameter is identified by .
Throughout, we will denote the functional of the observed data that identifies a causal estimand of interest as ψn. This is the statistical parameter about which we would like to perform inference.
3.3. Estimation
Estimation and inference for require a statistical model ℳ for the distribution of the observed data P(O). That is, ℳ is a collection of distributions over O of which one element is the true data-generating distribution. The only restriction placed on the observed data distribution by Assumptions (A1–(A3b) is that, for for i, j : Aij = 0 and ∃!k with Aik = Akj = 1 we have that Ci ⊥ Cj, Xi ⊥ Xj | c, and Yi ⊥ Yj | c, x; Assumptions (A4) and (A5) – and previous work such as van der Laan (2014) and Sofrygin and van der Laan (2017)– restrict these conditional independences to hold for all i, j. Assumption (A6) implies that Y ⊥ C, X | W, V and X ⊥ C | W. Under assumption (A6) the probability distribution of the observed data may be factorized as
| (4) |
suggesting that ℳ requires three components: a model for pC, a model for g, and a model for pY. Furthermore, the identification results in (2) and (3) indicate that the identifying functional ψn depends on pY only through m, and under assumption (A6) it depends on g only through h. The empirical distribution can be used throughout to nonparametrically average with respect to pC, but, when C is high-dimensional, h and m may not be nonparametrically estimable at rates of convergence that are fast enough to satisfy the regularity conditions of Theorem 1 (see Appendix). Therefore, we will specify a statistical model ℳ = ℳh × ℳm, where ℳh is a collection of conditional distributions for V given W such that the true conditional distribution is a member, and ℳm is a collection of conditional expectations of Y given V and W such that the true conditional expectation of Y is a member. Because P(C = c), and g(x | w) and pY(y | v, w)are all variation independent, ℳh does not restrict P (C = c) or pY(y | v, w) and ℳm does not restrict P(C = c) or g (x | w). In principle we do not need to be able to estimate h and m at parametric rates in order for our estimator to achieve asymptotic normality at parametric rates (see Appendix for specific rate conditions). Nonparametric estimation procedures for dependent data are an active area of research (Bibaut et al., 2021) but may still be difficult to implement under latent variable dependence, and in our data analysis and simulations we rely on parametric models for h and m.
Under assumptions (A4)–(A7) an influence function for ψn, evaluated at a fixed value o of O, was derived by van der Laan (2014) and its sample average is given by
| (5) |
where , and . For a deterministic intervention, is not random; we discuss the case of random in Section 4. Dn(o) has expected value equal to 0 at the true ψn; this fact can be used to generate unbiased estimating equations for ψn. van der Laan (2014) showed that estimators with this influence function are doubly robust: the right hand side of Equation (5) has expected value equal to 0 if m(·) is replaced with an arbitrary functional of V or if h(·) is replaced with an arbitrary functional of W, as long as one of the two remains correctly specified. This implies that an estimating equation based on Equation (5) will be unbiased for ψn if either model ℳm for m(·) or model ℳh for h (·) is correctly specified, i.e. contains the truth, even if one is not. See Section 6 of van der Laan (2014) for the derivation of the influence function and the proof of double robustness.
Although van der Laan (2014) derived this influence function under a model defined by assumption (A4) (in addition to (A6) and (A7), which we assume throughout), the same derivation holds under assumptions (A2) and (A3). This is because, as we argued above, the same functional ψn identifies under either set of assumptions, and furthermore any estimator that is (asymptotically) unbiased under (A4) and (A5) will remain so when these are relaxed to (A2) and (A3), implying that any influence function under the model defined by (A4) and (A5) is also an influence function under (A2) and (A3). In Theorem 1 we will prove that the resulting estimator is CAN under assumptions (A2) and (A3).
Below we propose a targeted maximum loss-based estimator (TMLE) of ψn. All of the results that follow are equally applicable to a standard estimating equation approach in which estimating equations for the parameters indexing a model for m and a model for h are stacked with the influence function estimating equation for ψn (see, e.g., Kennedy, 2016). More details about implementation can be found in companion papers focused on implementation and computation (Sofrygin et al., 2017, 2018; Sofrygin and van der Laan, 2017) and an R package is available (Sofrygin and van der Laan, 2015).
TMLE is a general template for estimation of smooth parameters in semi- and nonparametric models. The estimation algorithm is constructed to solve an influence function estimating equation (hence the asymptotic equivalence with the estimating equation approach). In our setting, a TMLE is constructed using three elements: (i) a valid loss function L for the outcome regression model m, (ii) initial working estimators of m and and of h, and (iii) a parametric submodel m∈ of ℳ, the score of which corresponds to a particular component of the score based on the influence function Dn(o) and such that mϵ=0 = m(·). The TMLE is then defined by an iterative procedure that, at each step, estimates ϵ by minimizing the empirical risk of the loss function L at mϵ An updated estimate is then computed as , and the process is repeated until convergence. The TMLE is the estimator obtained in the final step of the iteration. The result of the previous iterative procedure is that, at the final step, the influence function estimating equation is solved. For more details about targeted maximum likelihood estimation, see van der Laan and Rose (2011). In the present setting, the TMLE for ψn based on Dn(o) requires only one iteration for convergence (van der Laan and Rose, 2011). Initial parametric estimators and of m and h may be found through maximum likelihood or loss-based estimation methods like standard regression models. (The proof of Theorem 1 also suffices to prove that an m-estimator for either of the nuisance models will be CAN for its expectation.) Alternatively, under a conditional independence structure analogous to that implied by assumptions (A1), (A4), and (A5), Benkeser et al. (2018) showed that super learning (van der Laan Mark et al., 2007) can be used to nonparametrically estimate the nuisance models. The empirical distribution is used to marginalize with respect to pC.
We propose using a direct estimate of that optimizes the log likelihood function as if the pooled sample (Vi, Wi) were i.i.d. It can be shown that this results in a valid loss function for , even for dependent observations (Vi, Wi) (Sofrygin and van der Laan, 2017; van der Laan, 2014). Similarly, one can construct a direct estimator of by first creating a sample and then directly optimizing the log likelihood function , as if the pooled sample were i.i.d. We perform estimation of the conditional mixture density using a conditional histogram approach, previously described for i.i.d. data in Munoz and van der Laan (2011). The approach relies on fitting the conditional hazards of individual bins from the support of Vi (given Wi) using separate parametric logistic regression models. In our highly-dependent network settings, the operational characteristics of the direct estimator of are unclear. However, we believe that the enormous computational advantages offered by this direct estimation route, along with the encouraging results obtained from our extensive simulations, merit the description of this estimator. We also realize that more theoretical work is needed to justify and improve upon this direct approach. For additional details and simulation results that demonstrate the performance of the direct estimation approach for mixture density , we refer to Sofrygin et al. (2017, 2018); Sofrygin and van der Laan (2017).
Now the TMLE of ψn is computed as follows:
- Define the auxiliary weights Hi as the ratio of estimated densities of V*, W and V, W evaluated at the observed value Wi. Compute the auxiliary weights as
Compute initial predicted outcome values and predicted potential outcome values evaluated at the counterfactual value . Under a static intervention is the degenerate random variable sx, i(x*).
- Construct a TMLE model update of by running a weighted interceptonly logistic regression model with weights Hi defined in step (1), Yi as the outcome and including as an offset. That is, define as the estimate of the intercept parameter ϵ from the following weighted logistic regression model
where . - Compute updated predicted potential outcomes as the fitted values of the regression from step (c), evaluated at v* rather than v (that is, at instead of ):
where i.e. the inverse of the logit function. - Compute the TMLE ψn as
The TMLE is doubly robust: it will be consistent for ψn if either the working model for h (in this case comprised of models for and ) or the working model for m is correctly specified. This resulting estimator remains CAN for ψn under assumptions (A2) and (A3) or (A4) and (A5), and the same procedure can be used to estimate the parameter conditional on C.
3.4. Asymptotic normality
We consider an asymptotic regime in which Ki may grow as n → ∞ and prove that converges to a normal limiting distribution under assumptions (A2) and (A3), which implies the same result under Assumptions (A4) and (A5). The proof relies on an analysis of the influence function of and is therefore agnostic to estimating procedure (i.e. it holds for the TMLE and estimating equation approaches). However, it requires that one of the models for m and h be correctly specified and converge to the truth at rate or that both models converge to the the truth such that the product of their rates of convergence is , e.g. both converge to the truth at rate (see the regularity conditions in the Appendix).
Theorem 1: Suppose that as n → ∞, where Kmax, n = maxi{Ki} for network size n. Under assumptions (A2), (A3), (A6), (A7), and regularity conditions (see Appendix),
for some finite σ2 and for some Cn such that
The asymptotic variance σ2 of ψn is given by the asymptotic variance of Dn(o), the sample average of the influence function of the estimator. The proof of Theorem 1 is in the Appendix. Broadly, the proof has two parts: first, to show that the second order terms in the expansion of ψn − ψn are stochastically less than , and second, to show that the first order terms converge to a normal distribution when scaled by a factor of order . The proof that the second order terms are stochastically less than is an extension of the empirical process theory of van der Vaart and Wellner (1996b) and follows from the proof in van der Laan (2014). For the proof that the first order terms converge to a normal distribution, we rely on Stein’s method of central limit theorem proof (Stein, 1972). Stein’s method allows us to derive a bound on the distance between our first order term (properly scaled) and a standard normal distribution; this bound depends on the degree distribution K1, …,Kn. We show that this bound converges to 0 as n → ∞ under regularity conditions and our running assumption that .
When all nodes have the same number of ties, i.e. Ki = Kmax, n for all i, then the rate of convergence will be given by . When Kmax, n is bounded above as n → ∞, as in van der Laan (2014), the rate of convergence will be . When Kmax, n → ∞ but some nodes have fewer than Kmax, n ties, the exact rate of convergence is between and but is difficult or impossible to determine analytically, as it may depend intricately on the structure of the network. The inferential procedures that we describe below do not require knowledge of the rate of convergence.
In Section 4.4, below, we discuss settings in which the conditions for this theorem fail to hold, and ways to recover valid inference for conditional estimands in some of these settings.
3.5. Inference
An asymptotically valid 95% confidence interval for ψn is given by . In practice neither σ nor Cn are likely to be known, but available variance estimation methods estimate the variance of ψn directly, incorporating the rate of convergence without requiring it to be known a priori.
In principle, an estimate of the variance of the ψn can always be obtained by the plug-in estimator of the variance of the influence function, which depends on the observed data only through m (·), g (·), and pc(c) (see Appendix). When dependence is due to direct transmission, that is under assumptions (A1), (A4), and (A5), C is i.i.d. and pc(c) can be estimated with the empirical distribution of C. We prove in the Appendix that the plug-in estimator using the empirical average of the square of the influence function, substituting ψn for ψn and the fitted values from the working models and for h and m, is consistent under correct specification of models for both m and h. Although the estimator ψn is doubly robust this variance estimator is not and may be anticonservative if one, but not both, of the models for m and h is correctly specified. Using flexible or non-parametric specifications for these models increases opportunities to estimate both consistently; this is feasible using i.i.d. methods when dependence is due to direct transmission because the dependent variables in each of the two nuisance models are conditionally i.i.d. For a detailed discussion of how to implement this variance estimator, see Sofrygin and van der Laan (2017).
An alternative approach to estimate the variance of under assumptions (A1), (A4), and (A5) is to employ the following version of a parametric bootstrap, which might offer improvements in finite-sample performance over the previously described approach. For each of B bootstrap iterations, indexed by b = 1, …, B, first n covariates are sampled with replacement, then a model fit is applied to sampling of n exposures , followed by a sample of n outcomes based on the existing outcome model fit . The corresponding bootstrap summaries and , for i = 1, …, n, are constructed by applying the summary functions sC and sX to Cb and Xb, respectively. This bootstrap sample is then used to obtain the predicted values from the existing auxiliary covariate fit , for i = 1, …, n, followed by a bootstrap-based fitting of ϵ, and finally, evaluation of bootstrap TMLE. Note that the TMLE model update is the only model fitting step needed at each iteration of the bootstrap, which significantly lowers the computational burden of this procedure. The variance estimate is then obtained by taking the empirical variance of bootstrap TMLE samples . Because the parametric bootstrap relies on known or assumed independences, and because only the TMLE model (i.e. not the full likelihood) is fit at each iteration, this procedure consistently estimates the variance of the first order terms in the expansion of ψn − ψn, and we prove in the Appendix that the higher order terms are asymptotically neglible. However, due to dependence across observations, one must be judicious with applications of the bootstrap. For example, the parametric bootstrap procedure described above requires conditional independence of Xi given Wi and Yi given (Vi, Wi), along with the consistent modeling of the corresponding factors of the likelihood. It may seem natural to sample Vi directly from its corresponding auxiliary model fit, but this is likely to result in an anti-conservative variance estimates, since the conditional independence structure assumed for V is unlikely to hold by virtue of its construction as a summary measure of the network.
When latent variable dependence is present, that is under assumptions (A1) through (A3), the empirical distribution may not consistently estimate functionals of pc(c) at rate . This is a problem for both methods of variance estimation; we describe three possible strategies for overcoming this challenge but acknowledge future research is needed to devise better solutions. The first option is to use a version of block bootstrap to estimate the required functionals of pc(c). Bootstrap methods for network data are an area of active research and beyond the scope of this paper but we note that, under our dependence assumptions, {CI} ⊥ {CJ} for sets of indices I and J such that no node in I is connected to any node in J by a path of fewer than 3 ties. This is an m-dependence structure and well-established methods for block-bootstrap for m-dependent data are immediately applicable (Lahiri, 2003). However, unlike standard m-dependence settings where the underlying topology is Euclidean, it could be computationally challenging to identify independent blocks in network data. We leave implementation of this procedure for future research. A second option is to postulate a parametric model for the required functionals of pC(c). Correctly specifying the joint distribution of C may be challenging in many settings. Finally, the third option is to restrict attention to conditional estimands, for which variance estimation does not require estimating functionals of pC(c). A simple plug-in estimator is available for the variance of the conditional influence function (see the Appendix and van der Laan, 2014) and this is the approach that we take in our simulations and data analysis. However, our theoretic results may require additional assumptions when C is continuous.
4. Extensions
In this section we extend the estimation procedure to two causal effects of great interest in the context of social networks: social contagion, or peer effects, and interventions on the network structure itself, i.e. interventions on A = [Aij : i, j ∈ {1, …, n}] where, as above, Aij ≡ I{subject i and j share a tie}.
4.1. Dynamic and stochastic interventions
A dynamic intervention assigns exposures as a user-specified, deterministic function of covariates. We operationalize this as the replacement of h (v | w) with a new, user-specified, function h* that depends on w but is nonrandom. A stochastic intervention assigns exposures as a user-specified, random function. We operationalize this as the replacement of h (v | w) with a new, user-specified, random distribution h* that may depend on w. In contrast, a static intervention replaces h with a constant function.
It is sometimes more natural to think about dynamic and stochastic interventions in terms of an intervention SEM that modifies fX, replacing it with a user-specified function of W. As long as the intervention SEM adheres to Assumption (A6), an intervention on fX induces an intervention distribution h* – this induced distribution is required for our estimation results to hold. Alternatively, we can imagine intervening on the functional form of SX. In particular, we can define two different summary functions: , a user-specified functional describing the hypothetical dependence of X on C, and , a user-specified functional describing the hypothetical dependence of Y on X. They are denoted by an asterisk because they index hypothetical interventions rather than realized data-generating mechanisms. Let and . Then the intervention SEM for this stochastic intervention is given by
| (6) |
This can be interpreted as an intervention where, for each x* in the support of X and for i = 1, …, n, Xi is set to x* with probability and Vi is set to deterministically for each possible realization x*. Because Y depends on X only through V, this is equivalent to an intervention that sets Vi to v with probability where . That is, it is equivalent to an intervention on h with the intervention distribution given by . Potential outcomes under dynamic and stochastic interventions are identified under the same no unmeasured confounding and positivity assumptions as deterministic interventions. The conditional support of V* must be included in the conditional support of V in order for the intervention to be supported by the data and the positivity assumption to hold. For dynamic interventions, identification follows (2) and (3); the only difference is that is determined by wi rather than being specified directly. Under a stochastic intervention, and are random and potential outcomes are identified as follows:
This generalizes from Equation (3) which reflects the fact that, for a static intervention, only puts mass on the single value . Letting ψn denote this observed data functional, a sample average influence function is given by Dn(o) in equation (5). The only piece of this IF that depends on the nature of the intervention is the form of h*.
The TMLE of ψn is computed according to the steps outlined in Section 3; the fact that X* and V* may be random does not affect the estimation algorithm. (Additional details and examples can be found in Sofrygin and van der Laan, 2017.) Theorem 1 is agnostic about whether or not h* is degenerate; therefore the same asymptotic results hold for the estimation of potential outcomes under static, dynamic, and stochastic interventions.
4.2. Peer effects
Define to be the outcome variable measured at a time previous to the primary outcome measurement Yj. Peer effects are the class of causal effects of on Yi for Aij = 1: the effects of alters’ outcomes on the subsequent outcome of an ego. If we let , we can use the framework above to define, identify, and estimate static, dynamic, and stochastic peer effects.
In order to maintain the identifying assumptions A2b and A3b, the time elapsed between Y0 and Y must permit transmission only between nodes and their immediate alters. Otherwise, if the outcome could have spread contagiously more broadly, there will be more dependence present than our methods can account for, and also possible confounding of the effect of on Yj for Aij = 1 due to mutual connections.
4.3. Interventions on network structure
As a special case of interventions on sX(·) or on sC,X(·) we consider interventions determined by changes to the network itself, i.e. interventions that add, remove, or relocate ties in the network. Consider an intervention that modifies sX by replacing the observed adjacency matrix A with a user-specified adjacency matrix A*. This intervention replaces sX,i(X) with . The intervention SEM differs from the data-generating SEM only in that Yi depends on the counterfactual treatments for the individuals with whom i shares ties in the intervention adjacency matrix A*. Similarly, an intervention on sC,X(·) would result in a dynamic intervention where depending on the covariate values for the individuals with whom i shares ties in the intervention adjacency matrix A*.
Interventions on summary features of the adjacency matrix can be operationalized as stochastic interventions. Instead of replacing A with a user-specified A*, an intervention on features of the network structure might replace A with a random draw from a class 𝒜* of n × n adjacency matrices that share the intervention features, stochastically according to some probability distribution over 𝒜*. For example, we might be interested in an intervention on sX that constrains the degree distribution of the network, e.g. fixing the maximum degree to be smaller than some D. We might specify , giving equal weight to each realization in the class 𝒜*. This kind of intervention sets Vi to v with probability . Note that defining a feasible intervention with respect to changes to the adjacency matrix is possible due to assumption (A6). This is a strong assumption; if network structure can affect Y via mechanisms of interest that do not operate through sX(·) then estimating these effects is more challenging (Ogburn et al., 2014; Toulis et al., 2018).
As with the stochastic interventions discussed in the previous section, positivity is a crucial assumption for identifying interventions on A: the support of V* must be the same as the support of V. If replacing A with A* assigns to unit i a value of V that not observed in the real data for a unit in the same W stratum as i, then the effect of the intervention that replaces A with A* is not identified for unit i. In general it may be possible to identify interventions on local but not global features of network structure. Examples of local features of network structure include the degree of subject i and local clustering around subject i: they depend on A only through subject i and subject i’s immediate contacts. A local clustering coefficient for node i can be defined as the proportion of potential triangles that include i as one vertex and that are completed, or the number of pairs of neighbors of i who are connected divided by the total number of pairs of neighbors of i Newman (2009). This measure of triangle completion captures the extent to which “the friend of my friend is also my friend”: triangle completion is high whenever two subjects who share a mutual contact are more likely to themselves share a tie than are two subjects chosen at random from the network. Positivity could hold if, within each level of W, subjects were observed to have a wide range of degrees and of triangle completion among their contacts. In contrast with degree and local clustering, network centrality is a node-specific attribute that nevertheless depends on the entire network structure. It captures the intuitive notion that some nodes are central and some nodes are fringe in any given network. It can be measured in many different ways, based, for example, on the number of network paths that intersect node i, on the probability that a random walk on the network will intersect node i, or on the mean distance between node i and the other nodes in network (see Chapter 7 of Newman (2009) for a comprehensive discussion of these and other centrality measures). Centrality is given by a univariate measure for each node in a network, but each node’s measure depends crucially on the entire graph. In reality it is not generally possible to intervene on centrality without altering the entire adjacency matrix A, and the positivity assumption is unlikely to hold.
4.4. Too many friends, too much influence
The conditions of Theorem 1 will be violated for any asymptotic regime in which the degree of one or more nodes grows at a rate equal to or faster than . This is problematic because social networks frequently have a small number of “hubs” –that is, nodes with very high degree (Newman, 2009), When a small number of individuals wield influence over a significant portion of the rest of the population, two problems arise for statistical inference. First, the number of hubs may stay small as n increases. If the hubs are systematically different from the rest of the population, then a fixed or slowly growing number of hubs would not allow for consistent inference about this distinct subpopulation. Second, and more importantly, the sweeping influence of hubs creates dependence among all of the influenced nodes that undermines inference. Our methods rely on the independence of Yi and Yj whenever nodes i and j do not share a tie or a mutual alter. When hubs are present, a significant proportion of nodes will share a connection to one of these hubs, undermining our methods.
We can recover valid inference using our methods if we condition on the hubs, treating them as features of the background network environment rather than as observations. This results in different causal effects or statistical estimands, as all of our inference is conditional on the identity and characteristics of the hubs. Imagine a social network comprised of the residents of a city in which a cultural or political leader is connected to almost all of the other nodes. It may be impossible to disentangle the influence of this leader, which affects every other node, from other processes simultaneously occurring among the other residents of the city. It will certainly be impossible to statistically learn about the hub, as the sample size for the hub subgroup is 1. But it may make sense to consider the hub as a feature of the city rather than a member of the network. We could then learn about other processes occurring among the other residents of the city, conditional on the behavior and characteristics of the leader. For example, we could evaluate the effect of a public health initiative encouraging residents to talk to their friends about the importance of exercise, but we could not evaluate a similar program targeting the leader’s communication about exercise.
Practically speaking, this implies that the methods we have proposed are inappropriate for networks in which the degree is large, compared to n, for one or more nodes. If many nodes are connected to a significant fraction of other nodes, this problem is intractable. However, if only a small number of nodes are highly connected we can condition on them to recover approximately valid inference using our methods for conditional estimands. There is a theoretical tradeoff between the rate of convergence of our estimators and the order of K relative to n that, in finite samples, becomes a practical tradeoff between generality and variance. Increasing the number of nodes classified as hubs will increase the rate of convergence by decreasing the size of K for the remaining, non-hub nodes (assuming that the number of hubs remains small compared to n so that the sample size does not decrease significantly when we exclude hubs from the analysis). On the other hand, classifying more nodes as hubs results in analyses that are increasingly specific: conditioning on a single hub may preserve generalizability to other networks (similar cities with similar leaders), but conditioning on many hubs is likely to limit the generalizability of the resulting inference.
5. Simulations
We conducted a simulation study that evaluated the finite sample and asymptotic behavior of the TMLE procedure described in Section 3.3. We generated social networks of size n = 500, n = 1, 000, and n = 10, 000 according to the preferential attachment model (Barabási and Albert, 1999), where the node degree (number of alters) distribution followed a power law with α = 0.5. We generated data with two different types of dependence: first with dependence due to direct transmission only, and second with both latent variable dependence and dependence due to direct transmission. Details of the simulations, along with results for networks generated under the small world model (Watts and Strogatz, 1998), are in the Appendix.
Our simulations mimicked a hypothetical study designed to increase the level of physical activity in a population comprised of members of a social network. For each community member indexed by i = 1, …, n, the study collected data on i’s baseline covariates, denoted Ci, which included the indicator of being physically active, denoted PAi and the network of alters–or friends–on each subject, Fi. The exposure or treatment, Xi, was assigned randomly to 25% of the community. For example, one can imagine a study where treated individuals received various economic incentives to attend a local gym. The outcome Yi was a binary indicator of maintaining gym membership for a pre-determined follow-up period. We estimated the average of the mean counterfactual outcomes under various hypothetical interventions h* on such a community. First, we considered a stochastic intervention which assigned each individual to treatment with a constant probability of 0.35; this differs from the observed allocation of treatment to 25% of the community members. We also considered a scenario in which the economic incentive was resource constrained and could only be allocated to up to 10% of community members. We estimated the effects of various targeted approaches to allocating the exposure. For example, we considered an intervention that targeted only the top 10% most connected members of the community, as such a targeted intervention would be expected to have a higher impact on the overall average probability of maintaining gym membership among the community, when compared to purely random assignment of exposure to 10% of the community. Another hypothetical intervention assigned an additional physically active friend to individuals with fewer than 10 friends. This is an intervention on the structure of the social network itself. Finally, we estimated the combined effect of simultaneously implementing intervention and the network-based intervention on the same community. For simplicity, we report the expected outcome under each of these interventions; causal effects defined as contrasts of these interventions can be easily estimated using the same methods.
The results from the simulations with dependence due to direct transmission are shown in the left panel of Figure 2. We estimated the marginal parameter and compared three different estimators of the asymptotic variance and the coverage of the corresponding confidence intervals. First, we looked at the naive plug-in i.i.d. estimator (“i.i.d. Var”) for the variance of the influence curve which treated observations as if they were i.i.d. Second, we used the plug-in variance estimator based on the influence function which adjusted for the correlated observations (“dependent IC Var”) (Sofrygin and van der Laan, 2017). Finally, we used the parametric bootstrap variance estimator (“bootstrap Var”) described in Section 3.5. The results from the simulations with latent variable dependence are in the right panel of Figure 2. We estimated the conditional parameter and compared two plug in variance estimators based on the conditional influence function : one that assumes conditionally i.i.d outcomes (conditional on X and C), which would be true if all dependence were due to direct transmission but is violated in the presence of latent variable dependence (“i.i.d. Var”), and one that does not make this assumption (“dependent IC Var”). In the Appendix we compare histograms of the estimates to the predicted normal limiting distribution.
Fig. 2.

Mean 95% CI length and coverage for the TMLE in preferential attachment network with dependence due to direct transmission (left panel) and with latent variable dependence (right panel), by sample size, intervention and CI type.
One of the lessons of our simulation study is that by leveraging the structure of the network it might be possible to achieve a larger overall intervention effect on a population level (Harling et al., 2016). For example, the results in the left panel of Figure 2 show that by targeting the exposure assignment to highly connected and physically active individuals, intervention increases the mean probability of sustaining gym membership compared to the similar level of un-targeted coverage of the exposure. We also demonstrated the feasibility of estimating effects of interventions on the observed network structure itself, such as intervention , which can be also combined with economic incentives, as it was mimicked by our hypothetical intervention . These combined interventions could be particularly useful in resource constrained environments, since they may result in larger community level effects at the lower coverage of the exposure assignment.
Results from simulations with dependence due to direct transmission show that conducting inference while ignoring the nature of the dependence in such datasets generally results in anticonservative variance estimates and under-coverage of CIs, which can be as low as 50% even for very large sample sizes (“i.i.d. Var” in the left panel of Figure 2). The CIs based on the dependent variance estimates (“dependent IC Var”) obtain nearly nominal coverage of 95% for large enough sample sizes, but can suffer in smaller sample sizes due to lack of asymptotic normality and near-positivity violations. Notably, the CIs based on the parametric bootstrap variance estimates provide the most robust coverage for smaller sample sizes, while attaining the nominal 95% coverage in large sample sizes for nearly all of the simulation scenarios (“bootstrap Var”). The apparent robustness of the parametric bootstrap method for inference in small sample sizes, even as low as n = 500, was one of the surprising finding of this simulation study. Similarly, in the simulations with latent variable dependence the variance estimates that assume conditionally i.i.d. outcomes, i.e. that dependence may be due to direct transmission but not to latent variables, are anti-conservative.
Code for all simulations is available in a github repository (github.com/osofr/Ogburn_etal_simulations).
6. Data Analysis
We reanalyze an influential study that used the partially reconstructed social network of FHS study participants in order to study peer effects for obesity (Christakis and Fowler, 2007; hereafter CF). To assess peer influence for obesity using FHS data, CF fit longitudinal logistic regression models of each individual’s obesity status at exam k = 2, 3, 4, 5, 6, 7 onto each of the individual’s social contacts’ obesity statuses at exam k and k − 1 with a separate entry into the model for each contact, controlling for individual covariates and for the node’s own obesity status at exam k – 1. They used generalized estimating equations to account for correlation within individual over time, but their model assumes independence across individuals. CF fit this model separately for ten different types of social connections, including siblings, spouses, and immediate neighbors, with estimates of the increased risk of obesity ranging from 27% to 171%, many of which were statistically significant. In contrast with our approach, in which each ego is treated as a single (possibly dependent) observation, the pairwise approach that treats each network tie as an independent observation can result in incoherent models for the full network (Lyons, 2011; Ogburn and VanderWeele, 2014). Furthermore, Lee and Ogburn (2020) found evidence of significant network dependence across observations, suggesting that even if the model were coherent the analysis is invalid due to unaccounted statistical dependence. However, until now no method has been available to reanalyze these data taking into account the network structure and corresponding causal and statistical dependence.
We reanalyzed data from the first two exams, using all ten types of social connections simultaneously (n = 3766). The full R code for this analysis is available in a github repository (github.com/osofr/Ogburn_etal_simulations); public versions of FHS data through 2008 are available from the dbGaP database. Instead of specifying pairwise models and treating each pair (i.e. each network tie) as an independent observation, our methods account for the entire social network structure and allow for considerable causal and statistical dependence among subjects. For each subject i we specified m(Vi, Wi) to be the regression model used in CF (2007), but with proportion of obese alters replacing the indicator that a single friend is obese at each visit. That is, we specified that the expected probability of obesity for subject i at visit 2 is a function of the proportion of i’s alters who were obese at visit 2 (this is the exposure of interest), subject i’s obesity at visit 1, the proportion of i’s alters who were obese at visit 1, and subject i-specific covariates age, sex, and education. CF argue that controlling for alters’ obesity status at visit 1 controls for confounding due to homophily. It is more likely that confounding due to homophily cannot be controlled using these data (Cohen-Cole and Fletcher, 2008; Noel and Nyhan, 2011; Shalizi and Thomas, 2011) and we do not purport to be estimating a true, unconfounded causal effect. However, under CF’s assumption of unconfoundedness, we can estimate the expected proportion of subjects who would be obese at visit 2 under various hypothetical interventions on each subject’s alters’ obesity statuses.
The pairwise parameter that CF estimated is not well-defined in a model that accounts for more than one tie simultaneously. Instead we estimated the expected probability of obesity at visit 2 under a hypothetical intervention to increase the number of each subject’s obese alters by 1; this dynamic intervention is well-defined under the assumption that a node’s alters’ obesity has an effect only through a parsimonious summary measure sX. This intervention is similar to CF’s pairwise parameter in that it estimates the effect of a single alter’s change in obesity status. The observed empirical probability of obesity at visit 2 was 0.137. The predicted outcome under intervention was identical up to three decimal places with a 95% parametric bootstrap confidence interval of (0.127, 0.147). We re-ran the CF analysis using only outcome data from the second visit in order to more closely track our own analysis. We then used the single time point CF logistic regression to predict the outcome under an intervention to set the alter in each ego-alter pair to have obesity status 1; this corresponds more closely (though not directly) to our causal estimand. Compared to the observed empirical probability of obesity of 0.137, the predicted outcome under intervention was 0.147 with a 95% bootstrap confidence interval of (0.138, 0.156). Like the original CF analyses, this seems to support the claim of peer effects for obesity. The non-null point estimate could be due in part to spurious associations due to dependence (Lee and Ogburn, 2020), and the slightly shorter confidence interval compared to our analysis could be due to underestimated variance.
In summary, our analysis is consistent with the hypothesis that the significant results of CF are spurious, due to dependence and/or model misspecification rather than true associations or effects. We caution against interpreting our estimates as true causal effects, both because of unobserved confounding in the FHS data and because the exposure was measured at the same time as the outcome. However, this is still an instructive comparison between our methods and the naive methods that are currently in common use. Accounting for the interdependence of the subjects in the FHS data undermines the findings of strong contagion effects for obesity.
7. Conclusion
We proposed new methods that allow for causal and statistical inference using data from a single interconnected social network, with causal and statistical dependence informed by network ties. In contrast to existing methods, our methods do not require randomization of an exogenous treatment and they have proven performance under asymptotic regimes in which the number of network ties grows (slowly) with sample size. In the absence of appropriate methods for assessing peer effects researchers have routinely relied on naive methods developed for independent units, and our analysis of peer effects for obesity in the Framingham Heart Study illustrates the dangers of that approach and the importance of new methods like ours.
Supplementary Material
Acknowledgements
The authors are grateful to Caleb Miles, Eric Tchetgen Tchetgen and Victor De Gruttola for helpful comments. Elizabeth L. Ogburn was supported by ONR grant N000141512343 and N000141812760. Oleg Sofrygin and Mark van der Laan were supported by NIH grant R01 AI074345-07.
References
- Ali MM and Dwyer DS “Social network effects in alcohol consumption among adolescents.” Addictive behaviors, 35(4):337–342 (2010). [DOI] [PubMed] [Google Scholar]
- Barabási A-L and Albert R “Emergence of scaling in random networks.” science, 286(5439):509–512 (1999). [DOI] [PubMed] [Google Scholar]
- Benkeser D, Ju C, Lendle S, and van der Laan M “Online cross-validation-based ensemble learning.” Statistics in medicine, 37(2):249–260 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bibaut A, Petersen M, Vlassis N, Dimakopoulou M, and van der Laan M “ Sequential causal inference in a single world of connected units.” arXiv preprint arXiv:2101.07380 (2021). [Google Scholar]
- Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA, and Ritov Y Efficient and adaptive estimation for semiparametric models, volume 2. Springer; New York: (1998). [Google Scholar]
- Cacioppo JT, Fowler JH, and Christakis NA “Alone in the crowd: the structure and spread of loneliness in a large social network.” Journal of personality and social psychology, 97(6):977 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caron F and Fox EB “Sparse graphs using exchangeable random measures.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1295–1366 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christakis N and Fowler J “The spread of obesity in a large social network over 32 years.” New England Journal of Medicine, 357(4):370–379 (2007). [DOI] [PubMed] [Google Scholar]
- —. “The collective dynamics of smoking in a large social network.” New England journal of medicine, 358(21):2249–2258 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- —. “Social network sensors for early detection of contagious outbreaks.” PloS one, 5(9):e12948 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen-Cole E and Fletcher J “Is obesity contagious? Social networks vs. environmental factors in the obesity epidemic.” Journal of Health Economics, 27(5):1382–1387 (2008). [DOI] [PubMed] [Google Scholar]
- Fowler JH and Christakis NA “Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study.” Bmj, 337:a2338 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haneuse S and Rotnitzky A “Estimation of the effect of interventions that modify the received treatment.” Statistics in medicine, 32(30):5260–5277 (2013). [DOI] [PubMed] [Google Scholar]
- Harling G, Wang R, Onnela J-P, and DeGruttola V “Leveraging Contact Network Structure in the Design of Cluster Randomized Trials.” Harvard University Biostatistics Working Paper Series, (Working Paper 199) (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hubbard AE, Kherad-Pajouh S, and van der Laan MJ “Statistical inference for data adaptive target parameters.” The international journal of biostatistics, 12(1):3–19 (2016). [DOI] [PubMed] [Google Scholar]
- Hudgens M and Halloran M “Toward causal inference with interference.” Journal of the American Statistical Association, 103(482):832–842 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssen A and Ostrovski V “The convolution theorem of Hájek and Le Cam-revisited.” arXiv preprint arXiv:1309.4984 (2013). [Google Scholar]
- Kennedy EH “Semiparametric theory and empirical processes in causal inference.” In Statistical causal inferences and their applications in public health research, 141–167. Springer; (2016). [Google Scholar]
- Lahiri SN Resampling methods for dependent data. New York: Springer; (2003). [Google Scholar]
- Lauritzen SL and Richardson TS “Chain graph models and their causal interpretations.” Journal of the Royal Statistical Society: Series B, 64(3):321–348 (2002). [Google Scholar]
- Lee Y and Ogburn EL “Network dependence can lead to spurious associations and invalid inference.” Journal of the American Statistical Association, 1–15 (2020). [Google Scholar]
- Lyons R “The spread of evidence-poor medicine via flawed social-network analysis.” Statistics, Politics, and Policy, 2(1) (2011). [Google Scholar]
- Madan A, Moturu ST, Lazer D, and Pentland AS “Social sensing: obesity, unhealthy eating and exercise in face-to-face networks.” In Wireless Health 2010, 104–110. ACM; (2010). [Google Scholar]
- McNeney B and Wellner JA “Application of convolution theorems in semiparametric models with non-iid data.” Journal of Statistical Planning and Inference, 91(2):441–480 (2000). [Google Scholar]
- Muñoz ID and van der Laan M “Population intervention causal effects based on stochastic interventions.” Biometrics, 68(2):541–549 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Munoz ID and van der Laan MJ “Super learner based conditional density estimation with application to marginal structural models.” The International Journal of Biostatistics, 7(1):1–20 (2011). [DOI] [PubMed] [Google Scholar]
- Newman M Networks: an introduction. Oxford: Oxford University Press; (2009). [Google Scholar]
- Newman ME and Park J “Why social networks are different from other types of networks.” Physical Review E, 68(3):036122 (2003). [DOI] [PubMed] [Google Scholar]
- Noel H and Nyhan B “The “unfriending” problem: The consequences of homophily in friendship retention for causal estimates of social influence.” Social Networks, 33(3):211–218 (2011). [Google Scholar]
- Ogburn EL, Shpitser I, and Lee Y “Causal inference, social networks and chain graphs.” Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(4):1659–1676 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ogburn EL and VanderWeele TJ “Vaccines, Contagion, and Social Networks.” arXiv preprint arXiv:1403.1241 (2014). [Google Scholar]
- Ogburn EL, VanderWeele TJ, et al. “Causal diagrams for interference.” Statistical science, 29(4):559–578 (2014). [Google Scholar]
- Pearl J “The causal foundations of structural equation modeling.” Technical report, CALIFORNIA UNIV LOS ANGELES DEPT OF COMPUTER SCIENCE (2012). [Google Scholar]
- Rosenquist JN, Murabito J, Fowler JH, and Christakis NA “The spread of alcohol consumption behavior in a large social network.” Annals of Internal Medicine, 152(7):426–433 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shalizi C and Thomas A “Homophily and contagion are generically confounded in observational social network studies.” Sociological Methods & Research, 40(2):211–239 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sofrygin O, Neugebauer R, and van der Laan MJ “Conducting Simulations in Causal Inference with Networks-Based Structural Equation Models.” arXiv preprint arXiv:1705.10376 (2017). [Google Scholar]
- Sofrygin O, Ogburn EL, and van der Laan MJ “Single Time Point Interventions in Network-Dependent Data.” In Targeted Learning in Data Science, 373–396. Springer; (2018). [Google Scholar]
- Sofrygin O and van der Laan MJ “tmlenet: Targeted Maximum Likelihood Estimation for Network Data.” R package version 0.1. 0 (2015). [Google Scholar]
- —. “Semi-parametric estimation and inference for the mean outcome of the single time-point intervention in a causally connected population.” Journal of causal inference, 5(1) (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein C “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables.” In Proc. Sixth Berkeley Symp. Math. Stat. Prob, 583–602 (1972). [Google Scholar]
- Tchetgen Tchetgen EJ, Fulcher IR, and Shpitser I “Auto-g-computation of causal effects on a network.” Journal of the American Statistical Association, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tchetgen Tchetgen EJ and VanderWeele T “On causal inference in the presence of interference.” Statistical Methods in Medical Research, 21(1):55–75 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toulis P, Volfovsky A, and Airoldi EM “Propensity score methodology in the presence of network entanglement between treatments.” arXiv preprint arXiv:1801.07310 (2018). [Google Scholar]
- Trogdon JG, Nonnemaker J, and Pais J “Peer effects in adolescent overweight.” Journal of health economics, 27(5):1388–1399 (2008). [DOI] [PubMed] [Google Scholar]
- van der Laan MJ “Causal Inference for a Population of Causally Connected Units.” Journal of Causal Inference, 0(0):2193–3677 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan MJ and Rose S Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media; (2011). [Google Scholar]
- van der Laan Mark J, Polley Eric C, et al. “Super learner.” Statistical Applications in Genetics and Molecular Biology, 6(1):1–23 (2007). [DOI] [PubMed] [Google Scholar]
- van der Vaart AW and Wellner JA “Convolution and Minimax Theorems.” In Weak Convergence and Empirical Processes, 412–422. Springer; (1996a). [Google Scholar]
- —. “Weak Convergence.” In Weak Convergence and Empirical Processes, 16–28. Springer; (1996b). [Google Scholar]
- Wasserman S “Comment on “Social contagion theory: Examining dynamic social networks and human behavior” by Nicholas Christakis and James Fowler.” Statistics in Medicine, 32(4):578–580 (2013). [DOI] [PubMed] [Google Scholar]
- Watts DJ and Strogatz SH “Collective dynamics of small-world networks.” Nature, 393(6684):440–442 (1998). [DOI] [PubMed] [Google Scholar]
- Young JG, Hernán MA, and Robins JM “Identification, Estimation and Approximation of Risk under Interventions that Depend on the Natural Value of Treatment Using Observational Data.” Epidemiologic Methods, 3(1):1–19 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
