Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Hadi Daneshmand; Manuel Gomez-Rodriguez; Le Song; Bernhard Schölkopf

. Author manuscript; available in PMC: 2015 Apr 28.

Published in final edited form as: JMLR Workshop Conf Proc. 2014 Jun;32(2):793–801.

Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Hadi Daneshmand ¹, Manuel Gomez-Rodriguez ¹, Le Song ², Bernhard Schölkopf ¹

PMCID: PMC4412853 NIHMSID: NIHMS680553 PMID: 25932466

Abstract

Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees?

Despite the increasing availability of cascade-data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous-time diffusion models using an $ℓ_{1}$ -regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe O(d³ log N) cascades, where d is the maximum number of parents of a node and N is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding inference algorithm, which we use to illustrate the consequences of our theoretical results, and show that our framework outperforms other alternatives in practice.

1. Introduction

Diffusion of information, behaviors, diseases, or more generally, contagions can be naturally modeled as a stochastic process that occur over the edges of an underlying network (Rogers, 1995). In this scenario, we often observe the temporal traces that the diffusion generates, called cascades, but the edges of the network that gave rise to the diffusion remain unobservable (Adar & Adamic, 2005). For example, blogs or media sites often publish a new piece of information without explicitly citing their sources. Marketers may note when a social media user decides to adopt a new behavior but cannot tell which neighbor in the social network influenced them to do so. Epidemiologist observe when a person gets sick but usually cannot tell who infected her. In all these cases, given a set of cascades and a diffusion model, the network inference problem consists of inferring the edges (and model parameters) of the unobserved underlying network (Gomez-Rodriguez, 2013).

The network inference problem has attracted significant attention in recent years (Saito et al., 2009; Gomez-Rodriguez et al., 2010; 2011; Snowsill et al., 2011; Du et al., 2012a), since it is essential to reconstruct and predict the paths over which information can spread, and to maximize sales of a product or stop infections. Most previous work has focused on developing network inference algorithms and evaluating their performance experimentally on different synthetic and real networks, and a rigorous theoretical analysis of the problem has been missing. However, such analysis is of outstanding interest since it would enable us to answer many fundamental open questions. For example, which conditions are sufficient to guarantee that we can recover a network given a large number of cascades? If these conditions are satisfied, how many cascades are sufficient to infer the network with high probability? Until recently, there has been a paucity of work along this direction (Netrapalli & Sanghavi, 2012; Abrahao et al., 2013) which provide only partial views of the problem. None of them is able to identify the recovery condition relating to the interaction between the network structure and the cascade sampling process, which we will make precise in our paper.

Overview of results

We consider the network inference problem under the continuous-time diffusion model recently introduced by Gomez-Rodriguez et al. (2011). We identify a natural incoherence condition for such a model which depends on both the network structure, the diffusion parameters and the sampling process of the cascades. This condition captures the intuition that we can recover the network structure if the co-occurrence of a node and its non-parent nodes is small in the cascades. Furthermore, we show that, if this condition holds for the population case, we can recover the network structure using an $ℓ_{1}$ -regularized maximum likelihood estimator and O(d³ log N) cascades, and the probability of success is approaching 1 in a rate exponential in the number of cascades. Importantly, if this condition also holds for the finite sample case, then the guarantee can be improved to O(d² log N) cascades. Beyond theoretical results, we also propose a new, efficient and simple proximal gradient algorithm to solve the $ℓ_{1}$ -regularized maximum likelihood estimation. The algorithm is especially well-suited for our problem since it is highly scalable and naturally finds sparse estimators, as desired, by using soft-thresholding. Using this algorithm, we perform various experiments illustrating the consequences of our theoretical results and demonstrating that it typically outperforms other state-of-the-art algorithms.

Related work

Netrapalli & Sanghavi (2012) propose a maximum likelihood network inference method for a variation of the discrete-time independent cascade model (Kempe et al., 2003) and show that, for general networks satisfying a correlation decay, the estimator recovers the network structure given O(d² log N) cascades, and the probability of success is approaching 1 in a rate exponential in the number of cascades. The rate they obtained is on a par with our results. However, their discrete diffusion model is less realistic in practice, and the correlation decay condition is rather restricted: essentially, on average each node can only infect one single node per cascade. Instead, we use a general continuous-time diffusion model (Gomez-Rodriguez et al., 2011), which has been extensively validated in real diffusion data and extended in various ways by different authors (Wang et al., 2012; Du et al., 2012a;b).

Abrahao et al. (2013) propose a simple network inference method, First-Edge, for a slightly different continuous-time independent cascade model (Gomez-Rodriguez et al., 2010), and show that, for general networks, if the cascade sources are chosen uniformly at random, the algorithm needs O(Nd log N) cascades to recover the network structure and the probability of success is approaching 1 only in a rate polynomial in the number of cascades. Additionally, they study trees and bounded-degree networks and show that, if the cascade sources are chosen uniformly at random, the error decreases polynomially as long as O(log N) and (d⁹ log² d log N) cascades are recorded respectively. In our work, we show that, for general networks satisfying a natural incoherence condition, our method outperforms the First-Edge algorithm and the algorithm for bounded-degree networks in terms of rate and sample complexity.

Gripon & Rabbat (2013) propose a network inference method for unordered cascades, in which nodes that are infected together in the same cascade are connected by a path containing exactly the nodes in the trace, and give necessary and sufficient conditions for network inference. However, they consider a restrictive, unrealistic scenario in which cascades are all three nodes long.

2. Continuous-Time Diffusion Model

In this section, we revisit the continuous-time generative model for cascade data introduced by Gomez-Rodriguez et al. (2011). The model associates each edge j → i with a transmission function, f(t_i|t_j; α_ij) = f(t_i − t_j; α_ji), a density over time parameterized by α_ji. This is in contrast to previous discrete-time models which associate each edge with a fixed infection probability (Kempe et al., 2003). Moreover, it also differs from discrete-time models in the sense that events in a cascade are not generated iteratively in rounds, but event timings are sampled directly from the transmission functions in the continuous-time model.

2.1. Cascade generative process

Given a directed contact network, $G = (V, E)$ with N nodes, the process begins with an infected source node, s, initially adopting certain contagion at time zero, which we draw from a source distribution $P (s)$ . The contagion is transmitted from the source along her out-going edges to her direct neighbors. Each transmission through an edge entails a random transmission time, τ = t_j − t_j, drawn from an associated transmission function f(τ−; α_ji). We assume transmission times are independent, possibly discovers tributed differently across edges, and, in some cases, can be arbitrarily large, τ → ∞. Then, the infected neighbors transmit the contagion to their respective neighbors, and the process continues. We assume that an infected node remains infected for the entire diffusion process. Thus, if a node i is infected by multiple neighbors, only the neighbor that first infects node i will be the true parent. Figure 1 illustrates the process.

The diffusion network structure (left) is unknown and we only observe cascades, which are N-dimensional vectors recording the times when nodes get infected by contagions that spread (right). Cascade 1 is (*t_a, t_b, t_c*, ∞, ∞, ∞), where *t_a <* *t_c < t_b*, and cascade 2 is (*∞, t_b, ∞, t_d, t_e, t_f*), where *t_b < t_d < t_e < t_f*. Each cascade contains a source node (dark red), drawn from a source distribution $P (s)$ , as well as infected (light red) and uninfected (white) nodes, and it provides information on black and dark gray edges but does not on light gray edges.

Observations from the model are recorded as a set Cⁿ of cascades {t¹, . . . , tⁿ}. Each cascade t^c is an N-dimensional vector $t^{c} ≔ (t_{1}^{c}, \dots, t_{N}^{c})$ recording when nodes are infected, $t_{k}^{c} \in [0, T^{c}] \cup {\infty}$ . Symbol ∞ labels nodes that are not infected during observation window [0, T ^c] – it does not imply they are never infected. The ‘clock’ is reset to 0 at the start of each cascade. We assume T ^c = T for all cascades; the results generalize trivially.

2.2. Likelihood of a cascade

Gomez-Rodriguez et al. (2011) showed that the likelihood of a cascade t under the continuous-time independent cascade model is

f (t; A) = \prod_{t_{i} \leq T} \prod_{t_{m} > T} S (T ∣ t_{i}; α_{i m}) \times \prod_{k : t_{k} < t_{i}} S (t_{i} ∣ t_{k}; α_{k i}) \sum_{j : t_{j} < t_{i}} H (t_{i} ∣ t_{j}; α_{j i}),

(1)

where A = {α_ji} denotes the collection of parameters, $S (t_{i} ∣ t_{j}; α_{j i}) = 1 - \int_{t_{j}}^{t_{i}} f (t ∣ t_{j}; α_{j i})$ is the survival function and H(t_i|t_j; α_ji) = f(t_i|t_j; α_ji)/S(t_i|t_j; α_ji) is the hazard function. The survival terms in the first line account for the probability that uninfected nodes survive to all infected nodes in the cascade up to T and the survival and hazard terms in the second line account for the likelihood of the infected nodes. Then, assuming cascades are sampled independently, the likelihood of a set of cascades is the product of the likelihoods of individual cascades given by Eq. 1. For notational simplicity, we define y(t_i | t_k; α_ki) := log S(t_i|t_k; α_ki), and $h (t; α_{i}) ≔ \sum_{k : t_{k} \leq t_{i}} H (t_{i} ∣ t_{k}; α_{k i})$ if t_i ≤ T and 0 otherwise.

3. Network Inference Problem

Consider an instance of the continuous-time diffusion model defined above with a contact network $G^{*} = (V^{*}, E^{*})$ and associated parameters ${α_{i j}^{*}}$ . We denote the set of parents of node i as $N^{-} (i) = {j \in V^{*} : α_{j i}^{*} > 0}$ with cardinality $d_{i} = ∣ N^{-} (i) ∣$ and the minimum positive transmission rate as $α_{\min, i}^{*} = \min_{j : α_{j i}^{*} > 0} α_{j i}^{*}$ . Let Cⁿ be a set of n cascades sampled from the model, where the source $s \in V^{*}$ of each cascade is drawn from a source distribution $P (s)$ . Then, the network inference problem consists of fin-ding the directed edges and the associated parameters using only the temporal information from the set of cascades Cⁿ.

This problem has been cast as a maximum likelihood estimation problem (Gomez-Rodriguez et al., 2011)

\begin{matrix} {minimize}_{A} - \frac{1}{n} \sum_{c \in C^{n}} \log f (t^{c}; A) \\ subject to α_{j i} \geq 0, i, j = 1, \dots, N, i \neq j, \end{matrix}

(2)

where the inferred edges in the network correspond to those pairs of nodes with non-zero parameters, i.e. ${\hat{α}}_{j i} > 0$ .

In fact, the problem in Eq. 2 decouples into a set of independent smaller subproblems, one per node, where we infer the parents of each node and the parameters associated with these incoming edges. Without loss of generality, for a particular node i, we solve the problem

\begin{matrix} {minimize}_{α_{i}} ℓ^{n} (α_{i}) \\ subject to α_{j i} \geq 0, j = 1, \dots, N, i \neq j, \end{matrix}

(3)

where α_i := {α_ji | j = 1, . . . , N, i ≠ j are the relevant variables, and $ℓ^{n} (α_{i}) = - \frac{1}{n} \sum_{c \in C^{n}} g_{i} (t^{c}; α_{i})$ corresponds to the terms in Eq. 2 involving α_i (also see Table 1 for the definition of g(· ; α_i)). In this subproblem, we only need to consider a super-neighborhood $V_{i} = R_{i} \cup U_{i}$ of i with cardinality $p_{i} = ∣ V_{i} ∣ \leq N$ , where $R_{i}$ is the set of upstream nodes from which i is reachable, $U_{i}$ is the set of nodes which are reachable from at least one node $j \in R_{i}$ . Here, we consider a node i to be reachable from a node j if and only if there is a directed path from j to i. We can skip all nodes in $V ∕ V_{i}$ from our analysis because they will never be infected in a cascade before i, and thus, the maximum likelihood estimation of the associated transmission rates will always be zero (and correct).

Table 1.

Functions.

Function	Infected node (t_i < T)	Uninfected node (t_i > T)
g_i(t; α)	log h(t; α) + ∑_j_{:t_j < t_i} y(t_i\|t_j; α_j)	∑_j_{:t_j < T} y(T\|t_j; α_j)
[∇y(t; α)]_k	–y′(t_i\|t_k; α_k)	–y′(T\|t_k; α_k)
[D(t; α)]_kk	–y″(t_i\|t_k; α_k) – h(t; α)^–¹ H″(t_i\|t_k; α_k)	–y″(T\|t_k; α_k)

Open in a new tab

Below, we show that, as n → ∞, the solution, ${\hat{α}}_{i}$ , of the problem in Eq. 3 is a consistent estimator of the true parameter $α_{i}^{*}$ . However, it is not clear whether it is possible to recover the true network structure with this approach given a finite amount of cascades and, if so, how many cascades are needed. We will show that by adding an $ℓ_{1}$ -regularizer to the objective function and solving instead the following optimization problem

\begin{matrix} {minimize}_{α_{i}} ℓ^{n} (α_{i}) + λ_{n} {‖ α_{i} ‖}_{1} \\ subject to α_{j i} \geq 0, j = 1, \dots, N, i \neq j, \end{matrix}

(4)

we can provide finite sample guarantees for recovering the network structure (and parameters). Our analysis also shows that by selecting an appropriate value for the regularization parameter λ_n, the solution of Eq. 4 successfully recovers the network structure with probability approaching 1 exponentially fast in n.

In the remainder of the paper, we will focus on estimating the parent nodes of a particular node i. For simplicity, we will use α = α_i, α_j = α_ji, $N^{-} = N^{-} (i)$ , $R = R_{i}$ , $U = U_{i}$ , d = d_i, p_i = p and $α_{\min}^{*} = α_{\min, i}^{*}$ .

4. Consistency

Can we recover the hidden network structures from the observed cascades?

The answer is yes. We will show this by proving that the estimator provided by Eq. 3 is consistent, meaning that as the number of cascades goes to infinity, we can always recover the true network structure.

More specifically, Gomez-Rodriguez et al. (2011) showed that the network inference problem defined in Eq. 3 is convex in α if the survival functions are log-concave and the hazard functions are concave in α. Under these conditions, the Hessian matrix, $Q^{n} = \nabla^{2} ℓ^{n} (α)$ , can be expressed as the sum of a nonnegative diagonal matrix Dⁿ and the outer product of a matrix Xⁿ(α) with itself, i.e.,

Q^{n} = D^{n} (α) + \frac{1}{n} X^{n} (α) {[X^{n} (α)]}^{T} .

(5)

Here the diagonal matrix $D^{n} (α) = \frac{1}{n} \sum_{c} D (t^{c}; α)$ is a sum over a set of diagonal matrices D(t^c; α), one for each cascade c (see Table 1 for the definition of its entries); and Xⁿ(α) is the Hazard matrix

X^{n} (α) = [X (t^{1}; α) ∣ X (t^{2}; α) ∣ \dots ∣ X (t^{n}; α)],

(6)

with each column X(t^c; α) := h(t^c; α)⁻¹∇ _αh(t^c; α). Intuitively, the Hessian matrix captures the co-occurrence information of nodes in cascades. Then, we can prove

Theorem 1

If the source probability $P (s)$ is strictly positive for all $s \in R$ , then, the maximum estimator $\hat{α}$ given likelihood by the solution of Eq. 3 is consistent.

Proof

We check the three criteria for consistency: continuity, compactness and identification of the objective function (Newey & McFadden, 1994). Continuity is obvious. For compactness, since L → −∞ for both α_ij → 0 and αij → ∞ for all i, j so we lose nothing imposing upper and lower bounds thus restricting to a compact subset. For the identification condition, $α \neq α^{*} \Rightarrow ℓ^{n} (α) \neq ℓ^{n} (α^{*})$ , we use Lemma 9 and 10 (refer to Appendices A and B), which establish that Xⁿ(α) has full row rank as n → ∞, and hence $Q^{n}$ is positive definite.

5. Recovery Conditions

In this section, we will find a set of sufficient conditions on the diffusion model and the cascade sampling process under which we can recover the network structure from finite samples. These results allow us to address two questions:

Are there some network structures which are more difficult than others to recover?
What kind of cascades are needed for the network structure recovery?

The answers to these questions are intertwined. The difficulty of finite-sample recovery depends crucially on an incoherence condition which is a function of both network structure, parameters of the diffusion model and the cascade sampling process. Intuitively, the sources of the cascades in a diffusion network have to be chosen in such a way that nodes without parent-child relation should co-occur less often compared to nodes with such relation. Many commonly used diffusion models and network structures can be naturally made to satisfy this condition.

More specifically, we first place two conditions on the Hessian of the population log-likelihood, $E_{c} [ℓ^{n} (α)] = E_{c} [\log g (t^{c}; α)]$ , where the expectation here is taken over the distribution $P (s)$ of the source nodes, and the density f(t^c|s) of the cascades t^c given a source node s. In this case, we will further denote the Hessian of $E_{c} [\log g (t^{c}; α)]$ evaluated at the true model parameter α* as $Q^{*}$ . Then, we place two conditions on the Lipschitz continuity of X(t^c; α), and the boundedness of X(t^c; α*) and ∇g(t^c; α*) at the true model parameter α*. For simplicity, we will denote the subset of indexes associated to node i's true parents as S, and its complement as S^c. Then, we use $Q_{S S}^{*}$ to denote the sub-matrix of $Q^{*}$ indexed by S and $α_{S}^{*}$ the set of parameters indexed by S.

Condition 1 (Dependency condition)

There exists constants C_min > 0 and C_max > 0 such that $Λ_{m i n} (Q_{S S}^{*} \geq C_{m i n}$ and $Λ_{m a x} (Q_{S S}^{*}) \leq C_{m a x}$ where _min (·) and Λ_max(·) return the leading and the bottom eigenvalue of its argument respectively. This assumption ensures that two connected nodes co-occur reasonably frequently in the cascades but are not deterministically related.

Condition 2 (Incoherence condition)

There exists ε (0, 1] such that ${∣ ∣ ∣ Q_{S^{c} S}^{*} {(Q_{S S}^{*})}^{- 1} ∣ ∣ ∣}_{\infty} \leq 1 - ε$ where ${∣ ∣ ∣ A ∣ ∣ ∣}_{\infty} = \max_{j} \sum_{k} ∣ A_{i j} ∣$ . This assumption captures the intuition that, node i and any of its neighbors should get infected together in a cascade more often than node i and any of its non-neighbors.

Condition 3 (Lipschitz Continuity)

For any feasible cascade t^c, the Hazard vector X(t^c; α) is Lipschitz continuous in the domain ${α : α_{S} \geq α_{\min}^{*} ∕ 2}$ ,

{‖ X (t^{c}; β) - X (t^{c}; α) ‖}_{2} \leq k_{1} {‖ β - α ‖}_{2},

where k₁ is some positive constant. As a consequence, the spectral norm of the difference, n⁻¹^/²(Xⁿ(β) − Xⁿ(α)), is also bounded (refer to appendix C), i.e.,

{∣ ∣ ∣ n^{- 1 ∕ 2} (X^{n} (β) - X^{n} (α)) ∣ ∣ ∣}_{2} \leq k_{1} {‖ β - α ‖}_{2} .

(7)

Furthermore, for any feasible cascade t^c, D(α)_jj is Lipschitz continuous for all $j \in V$ ,

∣ D {(t^{c}; β)}_{j j} - D {(t^{c}; α)}_{j j} ∣ \leq k_{2} {‖ β - α ‖}_{2},

where k₂ is some positive constant.

Condition 4 (Boundedness)

For any feasible cascade t^c, the absolute value of each entry in the gradient of its log-likelihood and in the Hazard vector, as evaluated at the true model parameter α*, is bounded,

{‖ \nabla g (t^{c}; α^{*}) ‖}_{\infty} \leq k_{3}, {‖ X (t^{c}; α^{*}) ‖}_{\infty} \leq k_{4},

where k₃ and k₄ are positive constants. Then the absolute value of each entry in the Hessian matrix $Q^{*}$ , is also bounded ${∣ ∣ ∣ Q^{*} ∣ ∣ ∣}_{\infty} \leq k_{5}$ .

Remarks for condition 1

As stated in Theorem 1, as long as the source probability $P (s)$ is strictly positive for all $s \in R$ , the maximum likelihood formulation is strictly convex and thus there exists C_min > 0 such that $Λ_{m i n} (Q^{*}) \geq C_{m i n}$ . Moreover, condition 4 implies that there exists C_max > 0 such that $Λ_{m a x} (Q^{*}) \leq C_{m a x}$ .

Remarks for condition 2

The incoherence condition depends, in a non-trivial way, on the network structure, diffusion parameters, observation window and source node distribution. Here, we give some intuition by studying three small canonical examples.

First, consider the chain graph in Fig. 2(a) and assume that we would like to find the incoming edges to node 3 when T → ∞. Then, it is easy to show that the incoherence condition is satisfied if (P₀ + P₁)/(P₀ + P₁ + P₂) < 1 − ε and P₀/(P₀ +P₁ +P₂) < 1 − ε denotes , where P_i the probability of a node i to be the source of a cascade. Thus, for example, if the source of each cascade is chosen uniformly at random, the inequality is satisfied. Here, the incoherence condition depends on the source node distribution.

Second, consider the directed tree in Fig. 2(b) and assume that we would like to find the incoming edges to node 0 when T → ∞. Then, it can be shown that the incoherence condition is satisfied as long as (1) P₁ > 0, (2) (P₂ > 0) or (P₅ > 0 and P₆ > 0), and (3) P₃ > 0. As in the chain, the condition depends on the source node distribution.

Finally, consider the star graph in Fig. 2(c), with exponential edge transmission functions, and assume that we would like to find the incoming edges to a leave node i when T < ∞. Then, as long as the root node has a nonzero probability P₀ > 0 of being the source of a cascade, it can be shown that the incoherence condition reduces to the inequalities $(1 - \frac{α_{0 j}}{α_{0 i} + α_{0 j}}) e^{- {(α_{0 i} + α_{0 j})}^{T}} + \frac{α_{0 j}}{α_{0 i} + α_{0 j}} < 1 - ε (1 + e^{- {α_{0 i}}^{T}}), j = 1, \dots, p : j \neq i$ , which always holds for some ε > 0. If T → ∞, then the condition holds whenever ε < α₀_i/(α₀_i + max_j:j≠i α_0j). Here, the larger the ratio max_j:j≠i α₀_j//α₀_i is, the smaller the maximum value of ε for which the incoherence condition holds. To summarize, as long as P₀ > 0, there is always some ε > 0 for which the condition holds, and such ε value depends on the time window and the parameters α₀_j.

Remarks for conditions 3 and 4

Well-known pairwise transmission likelihoods such as exponential, Rayleigh or Power-law, used in previous work (Gomez-Rodriguez et al., 2011), satisfy conditions 3 and 4.

6. Sample Complexity

How many cascades do we need to recover the network structure?

We will answer this question by providing a sample complexity analysis of the optimization in Eq. 4. Given the conditions spelled out in Section 5, we can show that the number of cascades needs to grow polynomially in the number of true parents of a node, and depends only logarithmically on the size of the network. This is a positive result, since the network size can be very large (millions or billions), but the number of parents of a node is usually small compared the network size. More specifically, for each individual node, we have the following result:

Theorem 2

Consider an instance of the continuous-time diffusion model with parameters $α_{j i}^{*}$ and associated edges ε* such that the model satisfies condition 1-4, and let Cⁿ be a set of n cascades drawn from the model. Suppose that the regularization parameter λ_n is selected to satisfy

λ_{n} \geq 8 k_{3} \frac{2 - ε}{ε} \sqrt{\frac{\log p}{n}} .

(8)

Then, there exist positive constants L and K, independent of (n, p, d), such that if

n > L d^{3} \log p,

(9)

then the following properties hold with probability at least $1 - 2 \exp (- K λ_{n}^{2} n)$ :

For each node $i \in V$ , the $ℓ_{1}$ -regularized network infe rence problem defined in Eq. 4 has a unique solution, and so uniquely specifies a set of incoming edges of node i.
For each node $i \in V$ , the estimated set of incoming edges does not include any false edges and include all true edges.

Furthermore, suppose that the finite sample Hessian matrix $Q^{n}$ satisfies conditions 1 and 2. Then there exist positive constants L and K, independent of (n, p, d), such that the sample complexity can be improved to n > Ld² log p with other statements remain the same.

Remarks

The above sample complexity is proved for each node separately for recovering its parents. Using a union bound, we can provide the sample complexity for recovering the entire network structure by joining these parent-child relations together. The resulting sample complexity and the choice of regularization parameters will remain largely the same, except that the dependency on d will change from d to d_max (the largest number of parents of a node), and the dependency on p will change from log p to 2 log N (N the number of nodes in the network).

6.1. Outline of Analysis

The proof of Theorem 2 uses a technique called primal-dual witness method, previously used in the proof of sparsistency of Lasso (Wainwright, 2009) and high-dimensional Ising model selection (Ravikumar et al., 2010). To the best of our knowledge, the present work is the first that uses this technique in the context of diffusion network inference. First, we show that the optimal solutions to Eq. 4 have shared sparsity pattern, and under a further condition, the solution is unique (proven in Appendix D):

Lemma 3

Suppose that there exists an optimal primal-dual solution $(\hat{α}, \hat{μ})$ to Eq. 4 with an associated subgradient vector ẑ such that ${‖ {\hat{z}}_{S^{c}} ‖}_{\infty} < 1$ . Then, any optimal primal solution $\tilde{α}$ must have ${\tilde{α}}_{S^{c}} = 0$ . Moreover, if the Hessian sub-matrix $Q_{S S}^{n}$ is strictly positive definite, then $\hat{α}$ is the unique optimal solution.

Next, we will construct a primal-dual vector $(\hat{α}, \hat{μ})$ along with an associated subgradient vector ẑ. Furthermore, we will show that, under the assumptions on (n, p, d) stated in Theorem 2, our constructed solution satisfies the KKT optimality conditions to Eq. 4, and the primal vector has the same sparsity pattern as the true parameter α* , i.e.,

{\hat{α}}_{j} > 0, \forall j : α_{j}^{*} > 0,

(10)

{\hat{α}}_{j} = 0, \forall j : α_{j}^{*} = 0 .

(11)

Then, based on Lemma 3, we can deduce that the optimal solution to Eq. 4 correctly recovers the sparsisty pattern of α* , and thus the incoming edges to node i.

More specifically, we start by realizing that a primal-dual optimal solution $(\tilde{α}, \tilde{μ})$ to Eq. 4 must satisfy the generalized Karush-Kuhn-Tucker (KKT) conditions (Boyd & Vandenberghe, 2004):

0 \in \nabla ℓ^{n} (\tilde{α}) + λ_{n} \tilde{z} - \tilde{μ},

(12)

{\tilde{μ}}_{j} {\tilde{α}}_{j} = 0,

(13)

{\tilde{μ}}_{j} \geq 0,

(14)

{\tilde{z}}_{j} = 1, \forall {\tilde{α}}_{j} > 0,

(15)

∣ {\tilde{z}}_{j} ∣ \leq 1, \forall {\tilde{α}}_{j} = 0,

(16)

where $ℓ^{n} (\tilde{α}) = - \frac{1}{n} \sum_{c \in C^{n}} \log g (t^{c}; \tilde{α})$ and z̃ denotes the subgradient of the $ℓ_{1}$ -norm.

Suppose the true set of parent of node i is S. We construct the primal-dual vector $(\hat{α}, \hat{μ})$ and the associated subgradient vector ẑ in the following way

We set ${\hat{α}}_{S}$ as the solution to the partial regularized maximum likelihood problem
${\hat{α}}_{S} = \underset{(α_{S}, 0), α_{S} \geq 0}{argmin} {ℓ^{n} (α) + λ_{n} {‖ α_{S} ‖}_{1}} .$ (17)
Then, we set ${\hat{μ}}_{S} \geq 0$ as the dual solution associated to the primal solution ${\hat{α}}_{S}$ .
We set ${\hat{α}}_{S^{c}} = 0$ , so that condition (11) holds, and ${\hat{μ}}_{S^{c}} = μ_{S^{c}}^{*} \geq 0$ , where μ* is the optimal dual solution to the following problem:
$\begin{matrix} {minimize}_{α} E_{c} [ℓ^{n} (α)] \\ subject to α_{j} \geq 0, j = 1, \dots, N, i \neq j . \end{matrix}$ (18)
Thus, our construction satisfies condition (14).
We obtain ${\hat{z}}_{S^{c}}$ from (12) by substituting in the constructed $\hat{α}$ , $\hat{μ}$ and ẑ_S.

Then, we only need to prove that, under the stated scalings of (n, p, d), with high-probability, the remaining KKT conditions (10), (13), (15) and (16) hold.

For simplicity of exposition, we first assume that the dependency and incoherence conditions hold for the finite sample Hessian matrix $Q^{n}$ . Later we will lift this restriction and only place these conditions on the population Hessian matrix $Q^{*}$ . The following lemma show that our constructed solution satisfies condition (10):

Lemma 4

Under condition 3, if the regularization parameter is selected to satisfy

\sqrt{d} λ_{n} \leq \frac{C_{\min}^{2}}{6 (k_{2} + 2 k_{1} \sqrt{C_{\max}})},

and ${‖ \nabla_{s} ℓ^{n} (α^{*}) ‖}_{\infty} \leq \frac{λ_{n}}{4}$ , then,

{‖ {\hat{α}}_{S} - α_{S}^{*} ‖}_{2} \leq 3 \sqrt{d} λ_{n} ∕ C_{\min} \leq α_{\min}^{*} ∕ 2,

as long as $α_{\min}^{*} \geq 6 \sqrt{d} λ_{n} ∕ C_{\min}$ . Based on this lemma, we can then further show that the KKT conditions (13) and (15) also hold for the constructed solution. This can be trivially deduced from condition (10) and (11), and our construction steps (a) and (b). Note that it also implies that ${\hat{μ}}_{S} = μ_{S}^{*} = 0$ , and hence $\hat{μ} = μ^{*}$ .

Proving condition (16) is more challenging. We first provide more details on how to construct ${\hat{z}}_{S^{c}}$ mentioned in step (c). We start by using a Taylor expansion of Eq. 12,

Q^{n} (\hat{α} - α^{*}) = - \nabla ℓ^{n} (α^{*}) - λ_{n} \hat{z} + \hat{μ} - R^{n},

(19)

where Rⁿ is a remainder term with its j-th entry

R_{j}^{n} = {[\nabla^{2} ℓ^{n} ({\tilde{α}}_{j}) - \nabla^{2} ℓ^{n} (α^{*})]}_{j}^{T} (\hat{α} - α^{*}),

and ${\overset{‒}{α}}_{j} = θ_{j} \hat{α} + (1 - θ_{j}) α^{*}$ with θ_j ∈ [0, 1] according to the mean value theorem. Rewriting Eq. 19 using block matrices

(\begin{matrix} Q_{S S}^{n} & Q_{S S^{c}}^{n} \\ Q_{S^{c} S}^{n} & Q_{S^{c} S^{c}}^{n} \end{matrix}) (\begin{matrix} {\hat{α}}_{S} & - α_{S}^{*} \\ {\hat{α}}_{S^{c}} & - α_{S^{c}}^{*} \end{matrix}) = - (\begin{matrix} \nabla_{S} ℓ^{n} (α^{*}) \\ \nabla_{S^{c}} ℓ^{n} (α^{*}) \end{matrix}) - λ_{n} (\begin{matrix} {\hat{z}}_{S} \\ {\hat{z}}_{S^{c}} \end{matrix}) + (\begin{matrix} {\hat{μ}}_{S} \\ {\hat{μ}}_{S^{c}} \end{matrix}) - (\begin{matrix} R_{S}^{n} \\ R_{S^{c}}^{n} \end{matrix})

and, after some algebraic manipulation, we have

\begin{matrix} λ {\hat{z}}_{S^{c}} = - \nabla_{S^{c}} ℓ^{n} (α^{*}) + {\hat{μ}}_{S^{c}} - R_{S^{c}}^{n} \\ - Q_{S^{c} S}^{n} {(Q_{S S}^{n})}^{- 1} (- \nabla_{s} ℓ^{n} (α^{*}) - λ {\hat{z}}_{S} + {\hat{μ}}_{S} - R_{S}^{n}) . \end{matrix}

Next, we upper bound ${‖ {\hat{z}}_{S^{c}} ‖}_{\infty}$ using the triangle inequality

\begin{matrix} {‖ {\hat{z}}_{S^{c}} ‖}_{\infty} & \leq λ_{n}^{- 1} {‖ μ_{S^{c}}^{*} - \nabla_{S^{c}} ℓ^{n} (α^{*}) ‖}_{\infty} + λ_{n}^{- 1} {‖ R_{S^{c}}^{n} ‖}_{\infty} \\ + {‖ Q_{S^{c} S}^{n} {(Q_{S S}^{n})}^{- 1} ‖}_{\infty} \times [1 + λ_{n}^{- 1} {‖ R_{S}^{n} ‖}_{\infty} \\ + λ_{n}^{- 1} {‖ μ_{S}^{*} - \nabla_{S} ℓ^{n} (α^{*}) ‖}_{\infty}], \end{matrix}

and we want to prove that this upper bound is smaller than 1. This can be done with the help of the following two lemmas (proven in Appendices F and G):

Lemma 5

Given ε ∈ (0, 1] from the incoherence condition, we have,

P (\frac{2 - ε}{λ_{n}} {‖ \nabla ℓ^{n} (α^{*}) - μ^{*} ‖}_{\infty} \geq 4^{- 1} ε) \geq 2 p \exp (- \frac{n λ_{n}^{2} ε^{2}}{32 k_{3}^{2} {(2 - ε)}^{2}}),

which converges to zero at rate $\exp (- c λ_{n}^{2} n)$ as long as $λ_{n} \geq 8 k_{3} \frac{2 - ε}{ε} \sqrt{\frac{\log p}{n}}$ .

Lemma 6

Given ε ∈ (0, 1] from the incoherence condition, if conditions 3 and 4 holds, λ_n is selected to satisfy

λ_{n} d \leq C_{\min}^{2} \frac{ε}{36 K (2 - ε)},

where $K = k_{1} + k_{4} k_{1} + k_{1}^{2} + k_{1} \sqrt{C_{\max}}$ , and ${‖ \nabla_{s} ℓ^{n} (α^{*}) ‖}_{\infty} \leq \frac{λ_{n}}{4}$ , then, $\frac{{‖ R^{n} ‖}_{\infty}}{λ_{n}} \leq \frac{ε}{4 (2 - ε)}$ , as long as $α_{\min}^{*} \geq 6 \sqrt{d} λ_{n} ∕ C_{\min}$ .

Now, applying both lemmas and the incoherence condition on the finite smaple Hessian matrix $Q^{n}$ , we have

\begin{matrix} ‖ {\hat{z}}_{S^{c}} ‖ \infty & \leq (1 - ε) + λ_{n}^{- 1} (2 - ε) {‖ R^{n} ‖}_{\infty} \\ + λ_{n}^{- 1} (2 - ε) {‖ μ^{*} - \nabla ℓ^{n} (α^{*}) ‖}_{\infty} \\ \leq (1 - ε) + 0.25 ε + 0.25 ε = 1 - 0.5 ε, \end{matrix}

and thus condition (16) holds.

A possible choice of the regularization parameter λ_n and cascade set size n such that the conditions of the Lemmas 4-6 are satisfied is $λ_{n} = 8 k_{3} (2 - ε) ε^{- 1} \sqrt{n^{- 1} \log p}$ and $n > 288^{2} k_{3}^{2} {(2 - ε)}^{4} C_{m i n}^{- 4} ε^{- 4} d^{2} \log p + {(48 k_{3} (2 - ε) C_{m i n}^{- 1} {(α_{m i n}^{*})}^{- 1} ε^{- 1})}^{2} d \log p$ .

Last, we lift the dependency and incoherence conditions imposed on the finite sample Hessian matrix $Q^{n}$ . We show that if we only impose these conditions in the corresponding population matrix $Q^{*}$ , then they will also hold for $Q^{n}$ with high probability (proven in Appendices H and I).

Lemma 7

If condition 1 holds for $Q^{*}$ , then, for any δ > 0,

P (Λ_{m i n} (Q_{S S}^{n}) \leq C_{m i n} - δ) \leq 2 d^{B_{1}} \exp (- A_{1} \frac{δ^{2} n}{d^{2}}),

P (Λ_{m a x} (Q_{S S}^{n}) \geq C_{m a x} + δ) \leq 2 d^{B_{2}} \exp (- A_{2} \frac{δ^{2} n}{d^{2}}),

where A₁, A₂, B₁ and B₂ are constants independent of (n, p, d).

Lemma 8

If ${∣ ∣ ∣ Q_{S^{c} S}^{*} {(Q_{S S}^{*})}^{- 1} ∣ ∣ ∣}_{\infty} \leq 1 - ε$ , then,

P ({‖ Q_{S^{c} S}^{n} {(Q_{S S}^{n})}^{- 1} ‖}_{\infty} \geq 1 - ε ∕ 2) \leq p \exp (- K \frac{n}{d^{3}}),

where K is a constant independent of (n, p, d).

Note in this case the cascade set size need to increase to n > Ld³ log p, where L is a sufficiently large positive constant independent of (n, p, d), for the error probabilities on these last two lemmas to converge to zero.

7. Efficient soft-thresholding algorithm

Can we design efficient algorithms to solve Eq. (4) for network recovery?

Here, we will design a proximal gradient algorithm which is well suited for solving non-smooth, constrained, large-scale or high-dimensional convex optimization problems (Parikh & Boyd, 2013). Moreover, they are easy to understand, derive, and implement. We first rewrite Eq. 4 as an unconstrained optimization problem:

{minimize}_{α} ℓ^{n} (α) + g (α),

where the non-smooth convex function g(α) = λ_n ∥α∥1 if α ≥ 0 and +∞ otherwise. Here, the general recipe from Parikh & Boyd (2013) for designing proximal gradient algorithm can be applied directly.

Algorithm 1 summarizes the resulting algorithm. In each iteration of the algorithm, we need to compute $\nabla ℓ^{n}$ (Table 1) and the proximal operator prox_L^k_g(v), where L^k is a step size that we can set to a constant value L or find using a simple line search (Beck & Teboulle, 2009). Using Moreau's decomposition and the conjugate function g* , it is easy to show that the proximal operator for our particular function g(·) is a soft-thresholding operator, (v − λ_nL^k)₊, which leads to a sparse optimal solution $\hat{α}$ , as desired.

8. Experiments

In this section, we first illustrate some consequences of Th. 2 by applying our algorithm to several types of networks, parameters (n, p, d), and regularization parameter λ_n. Then, we compare our algorithm to two different state-of-the-art algorithms: NetRate (Gomez-Rodriguez et al., 2011) and First-Edge (Abrahao et al., 2013).

Experimental Setup

We focus on synthetic networks that mimic the structure of real-world diffusion networks – in particular, social networks. We consider two models of directed real-world social networks: the Forest Fire model (Barabási & Albert, 1999) and the Kronecker Graph model (Leskovec et al., 2010), and use simple pairwise transmission models such as exponential, power-law or Rayleigh. We use networks with 128 nodes and, for each edge, we draw its associated transmission rate from a uniform distribution U(0.5, 1.5). We proceed as follows: we generate a network $G^{*}$ and transmission rates A*, simulate a set of cascades and, for each cascade, record the node infection times. Then, given the infection times, we infer a network $\hat{G}$ . Finally, when we illustrate the consequences of Th. 2, we evaluate the accuracy of the inferred neighborhood of a node ${\hat{N}}^{-} (i)$ using probability of success $P (\hat{E} = E^{*})$ , estimated by running our method of 100 independent cascade sets. When we compare our algorithm to NetRate and First-Edge, we use the F₁ score, which is defined as 2P R/(P + R), where precision (P) is the fraction of edges in the inferred network $\hat{G}$ present in the true network $G^{*}$ , and recall (R) is the fraction of edges of the true network $G^{*}$ present in the inferred network $\hat{G}$ .

Parameters

(n, p, d) According to Th. 2, the number of cascades that are necessary to successfully infer the incoming edges of a node will increase polynomially to the node's neighborhood size d_i and logarithmically to the super-neighborhood size p_i. Here, we infer the incoming links of nodes of a hierarchical Kronecker network with the same in-degree (d_i = 3) but different super-neighboorhod set sizes p_i under different scalings β of the number of cascades n = 10βd log p and choose the regularization parameter λ_n as a constant factor of $\sqrt{\log (p) ∕ n}$ as suggested by Th. 2. We used an exponential transmission model and T = 5. Fig. 3(a) summarizes the results, where, for each node, we used cascades which contained at least one node in the super-neighborhood of the node under study. As predicted by Th. 2, very different p values lead to curves that line up with each other quite well.

Regularization parameter

λ_n Our main result indicates that the regularization parameter λ_n should be a constant factor of $\sqrt{\log (p) ∕ n}$ . Fig. 3(b) shows the success probability of our algorithm against different scalings K of the regularization parameter $λ_{n} = K \sqrt{\log (p) ∕ n}$ for different types of networks using 150 cascades and T = 5. We find that for sufficiently large λ_n, the success probability flat-tens, as expected from Th. 2. It flattens at values smaller than one because we used a fixed number of cascades n, which may not satisfy the conditions of Th. 2.

Comparison with NetRate and First-Edge

Fig. 4 compares the accuracy of our algorithm, NETRATE and First-Edge against number of cascades for a hierarchical Kronecker network with power-law transmission model and a Forest Fire network with exponential transmission model, with an observation window T = 10. Our method outperforms both competitive methods, finding especially striking the competitive advantage with respect to First-Edge.

9. Conclusions

Our work contributes towards establishing a theoretical foundation of the network inference problem. Specifically, we proposed a $ℓ_{1}$ -regularized maximum likelihood inference method for a well-known continuous-time diffusion model and an efficient proximal gradient implementation, and then show that, for general networks satisfying a natural incoherence condition, our method achieves an exponentially decreasing error with respect to the number of cascades as long as O(d³ log N) cascades are recorded.

Our work also opens many interesting venues for future work. For example, given a fixed number of cascades, it would be useful to provide confidence intervals on the inferred edges. Further, given a network with arbitrary pairwise likelihoods, it is an open question whether there always exists at least one source distribution and time window value such that the incoherence condition is satisfied, and, and if so, whether there is an efficient way of finding this distribution. Finally, our work assumes all activations occur due to network diffusion and are recorded. It would be interesting to allow for missing observations, as well as activations due to exogenous factors.

A. Proof of Lemma 9

Lemma 9

Given log-concave survival functions and concave hazard functions in the parameter(s) of the pairwise transmission likelihoods, then, a sufficient condition for the Hessian matrix $Q^{n}$ to be positive definite is that the hazard matrix Xⁿ(α) is non-singular.

Proof

Using Eq. 5, the Hessian matrix can be expressed as a sum of two matrices, Dⁿ(α) and Xⁿ(α)Xⁿ(α)^Τ. The matrix Dⁿ(α) is trivially positive semidefinite by log-concavity of the survival functions and concavity of the hazard functions. The matrix Xⁿ(α)Xⁿ(α)^Τ is positive definite matrix since Xⁿ(α) is full rank by assumption. Then, the Hessian matrix is positive definite since it is a sum a positive semidefinite matrix and a positive definite matrix.

B. Proof of Lemma 10

Lemma 10

If the source probability $P (s)$ is strictly positive for all $s \in R$ , then, for an arbitrarily large number of cascades n → ∞, there exists an ordering of the nodes and cascades within the cascade set such that the hazard matrix Xⁿ(α) is non-singular.

Proof

In this proof, we find a labeling of the nodes (row indices in Xⁿ(α)) and ordering of the cascades (column indices in Xⁿ(α)), such that, for an arbitrary large number of cascades, we can express the matrix Xⁿ(α) as [T B], where $T \in R^{p \times p}$ is an upper triangular with nonzero diagonal elements and $B \in R^{p \times n - p}$ . And, therefore, Xⁿ(α) has full rank (rank p). We proceed first by sorting nodes in $R$ and then continue by sorting nodes in $U$ :

• Nodes in $R$

For each node $u \in R$ , consider the set of cascades C_u in which u was2a source R and i got infected. Then, rank each node u according to the earliest position in which node i got infected across all cascades in C_u in decreasing order, breaking ties at random. For example, if a node u was, at least once, the source of a cascade in which node i got infected just after the source, but in contrast, node v was never the source of a cascade in which node i got infected the second, then node u will have a lower index than node v. Then, assign row k in the matrix Xⁿ(α) to node in position k and assign the first d columns to the corresponding cascades in which node i got infected earlier. In such ordering, Xⁿ(α)_mk = 0 for all m < k and Xⁿ(α)_kk≠ 0.

• Nodes in $U$

Similarly as in the first step, and assign them the rows d + 1 to p. Moreover, we assign the columns d + 1 to p to the corresponding cascades in which node i got infected earlier. Again, this ordering satisfies that Xⁿ(α)_mk = 0 for all m < k and Xⁿ(α)_kk≠ 0. Finally, the remaining columns n − p can be assigned to the remaining cascades at random.

This ordering leads to the desired structure [T B], and thus it is non-singular.

C. Proof of Eq 7

If the Hazard vector X(t^c; α) is Lipschitz continuous in the domain ${α : α_{S} \geq \frac{α_{\min}^{*}}{2}}$ ,

{‖ X (t^{c}; β) - X (t^{c}; α) ‖}_{2} \leq k_{1} {‖ β - α ‖}_{2},

where k₁ is some positive constant. Then, we can bound the spectral norm of the difference, $\frac{1}{\sqrt{n}} (X^{n} (β) - X^{n} (α))$ , in the domain ${α : α_{S} \geq \frac{α_{\min}^{*}}{2}}$ as follows:

\begin{matrix} {∣ ∣ ∣ \frac{1}{\sqrt{n}} (X^{n} (β) - X^{n} (α)) ∣ ∣ ∣}_{2} \\ = \max_{{‖ u ‖}_{2} = 1} \frac{1}{\sqrt{n}} {‖ u (X^{n} (β) - X^{n} (α)) ‖}_{2} \\ = \max_{{‖ u ‖}_{2} = 1} \frac{1}{\sqrt{n}} \sqrt{\sum_{c = 1}^{n} {〈 u, X (t^{c}; β) - X (t^{c}; α) 〉}^{2}} \\ \leq \frac{1}{\sqrt{n}} \sqrt{k_{1}^{2} n {‖ u ‖}_{2}^{2} {‖ β - α ‖}_{2}^{2}} \\ \leq k_{1} {‖ β - α ‖}_{2} . \end{matrix}

D. Proof of Lemma 3

By Lagrangian duality, the regularized network inference problem defined in Eq. 4 is equivalent to the following constrained optimization problem:

\begin{matrix} {minimize}_{α_{i}} & ℓ^{n} (α_{i}) \\ subkect to & α_{j i} \geq 0, j = 1, \dots, N, i \neq j, \\ {‖ α_{i} ‖}_{1} \leq C (λ_{n}) \end{matrix}

(20)

where C(λ_n) < ∞ is a positive constant. In this alternative formulation, λ_n is the Lagrange multiplier for the second constraint. Since λ_n is strictly positive, the constraint is active at any optimal solution, and thus ∥α∥₁ is constant across all optimal solutions.

Using that $ℓ^{n} (α_{i})$ is a differentiable convex function by assumption and {α : α_ji ≥ 0, ∥α_i∥₁ ≤ C(λ_n)} is a convex set, we have that $\nabla ℓ^{n} (α_{i})$ is constant across optimal primal solutions (Mangasarian , 1988). Moreover, any optimal primal-dual solution in the original problem must satisfy the KKT conditions in the alternative formulation defined by Eq. 20, in particular,

\nabla ℓ^{n} (α_{i}) = - λ_{n} z + μ,

where μ ≥ 0 are the Lagrange multipliers associated to the non negativity constraints and z denotes the subgradient of the $ℓ_{1}$ -norm.

Consider the solution $\hat{α}$ such that ${‖ {\hat{z}}_{S^{c}} ‖}_{\infty} < 1$ and thus $\nabla_{α_{S^{c}}} ℓ^{n} ({\hat{α}}_{i}) = - λ_{n} {\hat{z}}_{S^{c}} + {\hat{μ}}_{S^{c}}$ . Now, assume there is an optimal primal solution $\tilde{α}$ such that ${\tilde{α}}_{j i} > 0$ for some j ∈ S^c, then, using that the gradient must be constant across optimal solutions, it should hold that $- λ_{n} {\hat{z}}_{j} + {\hat{μ}}_{j} = - λ_{n}$ where ${\tilde{μ}}_{j i} = 0$ by complementary slackness, which implies ${\hat{μ}}_{j} = - λ_{n} (1 - {\hat{z}}_{j}) < 0$ . Since ${\hat{μ}}_{j} \geq 0$ by assumption, this leads to a contradiction. Then, any primal solution $\tilde{α}$ must satisfy ${\tilde{α}}_{S}^{c} = 0$ for the gradient to be constant across optimal solutions.

Finally, since $α_{S^{c}} = 0$ for all optimal solutions, we can consider the restricted optimization problem defined in Eq. 17. If the Hessian sub-matrix ${[\nabla^{2} L (\tilde{α})]}_{S S}$ is strictly positive definite, then this restricted optimization problem is strictly convex and the optimal solution must be unique.

E. Proof of Lemma 4

To prove this lemma, we will first construct a function

G (u_{S}) ≔ ℓ^{n} (α_{S}^{*} + u_{S}) - ℓ^{n} (α_{S}^{*}) + λ_{n} ({‖ α_{S}^{*} + u_{S} ‖}_{1} - ‖ α_{S}^{*} ‖) .

whose domain is restricted to the convex set $U = {u_{S} ∣ α_{S}^{*} + u_{S} \geq 0}$ . By construction, G(u_S) has the following properties

It is convex with respect to u_S.
Its minimum is obtained at ${\hat{u}}_{S} ≔ {\hat{α}}_{S} - α_{S}^{*}$ . That is G(û_S) ≤ G(u_S), ∀u_S =û_S.
G(û_S) ≤ G(0) = 0.

Based on property 1 and 3, we deduce that any point in the segment, $L ≔ {{\tilde{u}}_{S} : {\tilde{u}}_{S} = t {\hat{u}}_{S} + (1 - t) 0, t \in [0, 1]}$ , connecting û_S and 0 has G(ũ_S) ≤ 0. That is

\begin{matrix} G ({\tilde{u}}_{S}) & = G (t {\hat{u}}_{S} + (1 - t) 0) \\ \leq t G ({\hat{u}}_{S}) + (1 - t) G (0) \leq 0 . \end{matrix}

Next, we will find a sphere centered at 0 with strictly positive radius $B, S (B) ≔ {u_{S} : {‖ u_{S} ‖}_{2} = B}$ , such that function G(u_S) > 0 (strictly positive) on $S (B)$ . We note that this sphere $S (B)$ can not intersect with the segment $L$ since the two sets have strictly different function values. Furthermore, the only possible configuration is that the segment is contained inside the sphere entirely, leading us to conclude that the end point ${\hat{u}}_{S} ≔ {\hat{α}}_{S} - α_{S}^{*}$ is also within the sphere. That is ${‖ {\hat{α}}_{S} - α_{S}^{*} ‖}_{2} \leq B$ .

In the following, we will provide details on finding such a suitable B which will be a function of the regularization parameter λ_n and the neighborhood size d. More specifically, we will start by applying a Taylor series expansion and the mean value theorem,

G (u_{S}) = \nabla_{S} ℓ^{n} {(α_{S}^{*})}^{T} u_{S} + u_{S}^{T} \nabla_{S S}^{2} ℓ^{n} (α_{S}^{*} + b u_{S}) u_{S} + λ_{n} ({‖ α_{S}^{*} + u_{S} ‖}_{1} - {‖ α_{S}^{*} ‖}_{1}),

(21)

where b ∈ [0, 1]. We will show that G(u_S) > 0 by bounding below each term of above equation separately.

We bound the absolute value of the first term using the assumption on the gradient, $\nabla_{S} ℓ (\cdot)$ ,

∣ \nabla_{S} ℓ^{n} {(α_{S}^{*})}^{T} u_{S} ∣ \leq {‖ \nabla_{S} ℓ ‖}_{\infty} {‖ u_{S} ‖}_{1} \leq {‖ \nabla_{S} ℓ ‖}_{\infty} \sqrt{d} {‖ u_{S} ‖}_{2} \leq 4^{- 1} λ_{n} B \sqrt{d} .

(22)

We bound the absolute value of the last term using the reverse triangle inequality.

λ_{n} ∣ {‖ α_{S}^{*} + u_{S} ‖}_{1} - {‖ α_{S}^{*} ‖}_{1} ∣ \leq λ_{n} {‖ u_{S} ‖}_{1} \leq λ_{n} \sqrt{d} {‖ u_{S} ‖}_{2} .

(23)

Bounding the remaining middle term is more challenging. We start by rewriting the Hessian as a sum of two matrices, using Eq. 5,

\begin{matrix} q & = \min_{u_{S}} u_{S}^{T} D_{S S}^{n} (α_{S}^{*} + b u_{S}) u_{S} \\ + n^{- 1} u_{S}^{T} X_{S}^{n} (α_{S}^{*} + b u_{S}) X_{S}^{n} {(α_{S}^{*} + b u_{S})}^{T} u_{S} \\ = \min_{u_{S}} u_{S}^{T} D_{S S}^{n} (α_{S}^{*} + b u_{S}) u_{S} + {‖ u_{s}^{T} X_{S}^{n} (α_{S}^{*} + b u_{S}) ‖}_{2}^{2} . \end{matrix}

Now, we introduce two additional quantities,

\begin{matrix} Δ D_{S S}^{n} = D_{S S}^{n} (α_{S}^{*} + b u_{S}) - D_{S S}^{n} (α_{S}^{*}) \\ Δ X_{S}^{n} = X_{S}^{n} (α_{S}^{*} + b u_{S}) - X_{S}^{n} (α_{S}^{*}), \end{matrix}

and rewrite q as

\begin{matrix} q & = \min_{u_{S}} {[u_{S}^{T} D_{S S}^{n} (α_{S}^{*}) u_{S} + n^{- 1} {‖ u_{S}^{T} X}_{S}^{n} (α_{S}^{*}) ‖}_{2}^{2} \\ + n^{- 1} {‖ u_{S}^{T} Δ X_{S}^{n} ‖}_{2}^{2} + u_{S}^{T} Δ D_{S S}^{n} u_{S} \\ + 2 n^{- 1} 〈 u_{S}^{T} X_{S}^{n} (α_{S}^{*}), u_{S}^{T} Δ X_{S}^{n} 〉] . \end{matrix}

Next, we use dependency condition,

\begin{matrix} q \geq C_{\min} B^{2} & - \max_{u_{S}} ∣ \underset{T_{1}}{\underset{︸}{u_{S}^{T} Δ D_{S S}^{n} u_{S}}} ∣ \\ - \max_{u_{S}} 2 ∣ \underset{T_{2}}{\underset{︸}{n^{- 1} {〈 u_{S}^{T} X_{S}^{n} (α_{S}^{*}), u}_{2}^{T} Δ X_{S}^{n}}} ∣, \end{matrix}

and proceed to bound T₁ and T₂ separately. First, we bound T₁ using the Lipschitz condition,

\begin{matrix} ∣ T_{1} ∣ & = ∣ \sum_{k \in S} u_{k}^{2} p D_{k}^{n} (α_{S}^{*} + b u_{S}) - D_{k}^{n} (α_{S}^{*})] ∣ \\ \leq \sum_{l \in S} u_{k}^{2} k_{2} {‖ b u_{S} ‖}_{2} \\ \leq k_{2} B^{3} . \end{matrix}

Then, we use the dependency condition, the Lipschitz condition and the Cauchy-Schwartz inequality to bound T₂,

\begin{matrix} T_{2} & \leq \frac{1}{\sqrt{n}} {‖ u_{S}^{T} X_{S}^{n} (α_{S}^{*}) ‖}_{2} \frac{1}{\sqrt{n}} {‖ u_{S}^{T} Δ X_{S}^{n} ‖}_{2} \\ \leq \sqrt{C_{\max}} B \frac{1}{\sqrt{n}} {‖ u_{S}^{T} Δ X_{S}^{n} ‖}_{2} \\ \leq \sqrt{C_{\max}} B {‖ u_{S} ‖}_{2} \frac{1}{\sqrt{n}} {∣ ‖ Δ X_{S}^{n} ‖ ∣}_{2} \\ \leq \sqrt{C_{\max}} B^{2} k_{1} {‖ b u_{S} ‖}_{2} \\ \leq k_{1} \sqrt{C_{\max}} B^{3}, \end{matrix}

where we note that applying the Lipschitz condition implies assuming $B < \frac{α_{m i n}}{2}$ . Next, we incorporate the bounds of T₁ and T₂ to lower bound q,

q \geq C_{\min} B^{2} - (k_{2} + 2 k_{1} \sqrt{C_{\max}}) B^{3} .

(24)

Now, we set $B = K λ_{n} \sqrt{d}$ , where K is a constant that we will set later in the proof, and select the regularization parameter λ_n to statisfy $λ_{n} \sqrt{d} \leq 0.5 C_{\min} ∕ K (k_{2} + 2 k_{1} \sqrt{C_{\max}})$ . Then,

\begin{matrix} G (u_{S}) & \geq - 4^{- 1} λ_{n} \sqrt{d} B + 0.5 C_{\min} B^{2} - λ_{n} \sqrt{d} B \\ \geq B (0.5 C_{\min} B - 1.25 λ_{n} \sqrt{d}) \\ \geq B (0.5 C_{\min} K λ_{n} \sqrt{d} - 1.25 λ_{n} \sqrt{d}) . \end{matrix}

In the last step, we set the constant $K = 3 C_{\min}^{- 1}$ , and we have

G (u_{S}) \geq 0.25 λ_{n} \sqrt{d} > 0,

as long as

\begin{matrix} \sqrt{d} λ_{n} \leq \frac{C_{\min}^{2}}{6 (k_{2} + 2 k_{1} \sqrt{C_{\max}})} \\ α_{\min}^{*} \geq \frac{6 λ_{n} \sqrt{d}}{C_{\min}} . \end{matrix}

Finally, convexity of G(u_S) yields

{‖ {\hat{α}}_{S} - α_{S}^{*} ‖}_{2} \leq 3 λ_{n} \sqrt{d} ∕ C_{\min} \leq \frac{α_{\min}^{*}}{2} .

F. Proof of Lemma 5

Define $z_{j}^{c} = {[\nabla g (t^{c}; α^{*})]}_{j}$ and $z_{j} = \frac{1}{n} \sum_{c} z_{j}^{c}$ . Now, using the KKT conditions and condition 4 (Boundedness), we have that $μ_{j}^{*} = E_{c} {z_{j}^{c}}$ and $∣ z_{j}^{c} ∣ \leq k_{3}$ , respectively.

Thus, Hoeffding's inequality yields

P (∣ z_{j} - μ_{j}^{*} ∣ > \frac{λ_{n} ε}{4 (2 - ε)}) \leq 2 \exp (- \frac{n λ_{n}^{2} ε^{2}}{32 k_{3}^{2} {(2 - ε)}^{2}}),

and then,

P (‖ z - μ^{*} ‖ \infty > \frac{λ_{n} ε}{4 (2 - ε)}) \leq 2 \exp (\frac{n λ_{n}^{2} ε^{2}}{32 k_{3}^{2} {(2 - ε))}^{2}} + \log p) .

G. Proof of Lemma 6

We start by factorizing the Hessian matrix, using Eq. 5,

R_{j}^{n} = {[\nabla^{2} ℓ^{n} ({\overset{‒}{α}}_{j}) - \nabla^{2} ℓ^{n} (α^{*})]}_{j}^{T} (\hat{α} - α^{*}) = ω_{j}^{n} + δ_{j}^{n},

where,

\begin{matrix} ω_{j}^{n} & = {[D^{n} ({\overset{‒}{α}}_{j}) - D^{n} (α^{*})]}_{j}^{T} (\hat{α} - α^{*}) \\ δ_{j}^{n} & = \frac{1}{n} V_{j}^{n} (\hat{α} - α^{*}) \\ V_{j}^{n} & = {[X^{n} ({\overset{‒}{α}}_{j})]}_{j} X^{n} {({\overset{‒}{α}}_{j})}^{T} - {[X^{n} (α^{*})]}_{j} X^{n} {(α^{*})}^{T} . \end{matrix}

Next, we proceed to bound each term separately. Since ${[{\overset{‒}{α}}_{j}]}_{S} = θ_{j} {\hat{α}}_{S} + (1 - θ_{j}) α_{S}^{*}$ where θ_j ∈ [0, 1], and ${‖ {\hat{α}}_{S} - α_{S}^{*} ‖}_{\infty} \leq \frac{α_{\min}^{*}}{2}$ (Lemma 4), it holds that ${[{\overset{‒}{α}}_{j}]}_{S} \geq \frac{α_{\min}^{*}}{2}$ . Then, we can use condition 3 (Lipschitz Continuity) to bound $ω_{j}^{n}$ .

\begin{matrix} ∣ ω_{j}^{n} ∣ & \leq k_{1} {‖ {\overset{‒}{α}}_{j} - α^{*} ‖}_{2} {‖ \hat{α} - α^{*} ‖}_{2} \\ \leq k_{1} θ_{j} {‖ \hat{α} - α^{*} ‖}_{2}^{2} \\ \leq k_{1} {‖ \hat{α} - α^{*} ‖}_{2}^{2} . \end{matrix}

(25)

However, bounding term $δ_{j}^{n}$ is more difficult. Let us start by rewriting $δ_{j}^{n}$ as follows.

δ_{j}^{n} = (Λ_{1} + Λ_{2} + Λ_{3}) (\hat{α} - α^{*}),

where,

\begin{matrix} Λ_{1} = & {[X^{n} (α^{*})]}_{j} (X^{n} {({\overset{‒}{α}}_{j})}^{T} - X^{n} {(α^{*})}^{T}) \\ Λ_{2} = & {{[X^{n} ({\overset{‒}{α}}_{j})]}_{j} - {[X^{n} (α^{*})]}_{j}} (X^{n} {({\overset{‒}{α}}_{j})}^{T} - X^{n} {(α^{*})}^{T}) \\ Λ_{3} = & ({[X^{n} ({\overset{‒}{α}}_{j})]}_{j} - {[X^{n} (α^{*})]}_{j}) X^{n} {(α^{*})}^{T} . \end{matrix}

Next, we bound each term separately. For the first term, we first apply Cauchy inequality,

∣ Λ_{1} (\hat{α} - α^{*}) ∣ \leq {‖ {[X^{n} (α^{*})]}_{j} ‖}_{2} \times {∣ ‖ X^{n} {({\overset{‒}{α}}_{j})}^{T} - X^{n} {(α^{*})}^{T} ‖ ∣}_{2} {‖ \hat{α} - α^{*} ‖}_{2},

and then use condition 3 (Lipschtiz Continuity) and 4 (Boundedness),

\begin{matrix} ∣ Λ_{1} (\hat{α} - α^{*}) ∣ & \leq n k_{4} k_{1} {‖ {\overset{‒}{α}}_{j} - α^{*} ‖}_{2} {‖ \hat{α} - α^{*} ‖}_{2} \\ \leq n k_{4} k_{1} {‖ \hat{α} - α^{*} ‖}_{2}^{2} . \end{matrix}

For the second term, we also start by applying Cauchy inequality,

∣ Λ_{2} (\hat{α} - α^{*}) ∣ \leq {‖ {[X^{n} ({\overset{‒}{α}}_{j})]}_{j} - [X^{n} (α^{*})] ‖}_{2} \times {∣ ‖ X^{n} {({\overset{‒}{α}}_{j})}^{T} - X^{n} {(α^{*})}^{T} ‖ ∣}_{2} {‖ \hat{α} - α^{*} ‖}_{2},

and then use condition 3 (Lipschtiz Continuity),

∣ Λ_{2} (\hat{α} - α^{*}) ∣ \leq n k_{1}^{2} {‖ \hat{α} - α^{*} ‖}_{2}^{2} .

Last, for third term, once more we start by applying Cauchy inequality,

∣ Λ_{3} (\hat{α} - α^{*}) ∣ \leq {‖ {[X^{n} ({\overset{‒}{α}}_{j})]}_{j} - {[X^{n} (α^{*})]}_{j} ‖}_{2} \times {∣ ‖ X^{n} {(α^{*})}^{T} ‖ ∣}_{2} {‖ \hat{α} - α^{*} ‖}_{2},

and then apply condition 1 (Dependency Condition) and condition 3 (Lipschitz Continuity),

∣ Λ_{3} (\hat{α} - α^{*}) ∣ \leq n k_{1} \sqrt{C_{\max}} {‖ \hat{α} - α^{*} ‖}_{2}^{2}

Now, we combine the bounds,

{‖ R^{n} ‖}_{\infty} \leq K {‖ \hat{α} - α^{*} ‖}_{2}^{2},

where

K = k_{1} + k_{4} k_{1} + k_{1}^{2} + k_{1} \sqrt{C_{\max}} .

Finally, using Lemma 4 and selecting the regularization parameter λ_n to satisfy $λ_{n} d \leq C_{\min}^{2} \frac{ε}{36 K (2 - ε)}$ yields:

\begin{matrix} {‖ R^{n} ‖}_{\infty} ∕ λ_{n} & \leq 3 K λ_{n} d ∕ C_{\min}^{2} \\ \leq \frac{ε}{4 (2 - ε)} \end{matrix}

H. Proof of Lemma 7

We will first bound the difference in terms of nuclear norm between the population Fisher information matrix $Q_{S S}$ and the sample mean cascade log-likelihood $Q_{S S}^{n}$ . Define $z_{j k}^{c} = {[\nabla^{2} g (t^{c}; α^{*}) - \nabla^{2} ℓ^{n} (α^{*})]}_{j k}$ and $z_{j k} = \frac{1}{n} \sum_{c = 1}^{n} z_{j k}^{c}$ . Then, we can express the difference between the population Fisher information matrix $Q_{S S}$ and the sam ple mean cascade log-likelihood $Q_{S S}^{n}$ as:

{∣ ‖ Q_{S S}^{n} (α^{*}) - Q_{S S}^{*} (α^{*}) ‖ ∣}_{2} \leq {∣ ‖ Q_{S S}^{n} (α^{*}) - Q_{S S}^{*} (α^{*}) ‖ ∣}_{F} = \sqrt{\sum_{j = 1}^{d} \sum_{k = 1}^{d} {(z_{i k})}^{2}} .

Since $∣ z_{j k}^{(c)} ∣ \leq 2 k_{5}$ by condition 4, we can apply Hoeffding's inequality to each z_jk,

P (∣ z_{j k} ∣ \geq β) \leq 2 \exp (- \frac{β^{2} n}{8 k_{5}^{2}}),

(26)

and further,

P ({∣ ‖ Q_{S S}^{n} (α^{*}) - Q_{S S}^{*} (α^{*}) ‖ ∣}_{2} \geq δ) \leq 2 \exp (- K \frac{δ^{2} n}{d^{2}} + 2 \log d)

(27)

where $β^{2} = δ^{2} ∕ d^{2}$ . Now, we bound the maximum eigenvalue of $Q_{S S}^{n}$ as follows:

\begin{matrix} Λ_{\max} (Q_{S s}^{n}) & = \min_{{‖ x ‖}_{2} = 1} x^{T} Q_{S S}^{n} x \\ = \max_{{‖ x ‖}_{2} = 1} {x^{T} Q_{S S}^{*} x + x^{T} (Q_{S S}^{n} - Q_{S S}^{*}) x} \\ \leq y^{T} Q_{S S}^{*} y + y^{T} (Q_{S S}^{n} - Q_{S S}^{*}) y, \end{matrix}

where y is unit-norm maximal eigenvector of $Q_{S S}^{*}$ . Therefore,

Λ_{\max} (Q_{S S}^{n}) \leq Λ_{\max} (Q_{S S}^{*}) + {∣ ‖ Q_{S S}^{n} - Q_{S S}^{*} ‖ ∣}_{2},

and thus,

P (Λ_{\max} (Q_{S S}^{n}) \geq C_{\max} + δ) \leq \exp (- K \frac{δ^{2} n}{d^{2}} + 2 \log d) .

Reasoning in a similar way, we bound the minimum eigen- value of $Q_{S S}^{n}$ :

P (Λ_{\min} (Q_{S S}^{n}) \leq C_{\min} - δ) \leq \exp (- K \frac{δ^{2} n}{d^{2}} + 2 \log d)

I. Proof of Lemma 8

We start by decomposing $Q_{S^{c} S}^{n} (α^{*}) {(Q_{S^{c} S}^{n} (α^{*}))}^{- 1}$ as follows:

Q_{S^{c} S}^{n} (α^{*}) {(Q_{S^{c} S}^{n} (α^{*}))}^{- 1} = A_{1} + A_{2} + A_{3} + A_{4},

where,

\begin{matrix} A_{1} = & Q_{S^{c} S}^{*} [{(Q_{S^{c} S}^{n})}^{- 1} - {(Q_{S^{c} S}^{*})}^{- 1}], \\ A_{2} = & [Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*}] [{(Q_{S^{c} S}^{n})}^{- 1} - {(Q_{S^{c} S}^{*})}^{- 1}] \\ A_{3} = & [Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*}] {(Q_{S S}^{*})}^{- 1}, \\ A_{4} = & Q_{S^{c} S}^{*} {(Q_{S S}^{*})}^{- 1}, \end{matrix}

$Q^{*} = Q^{*} (α^{*})$ and $Q^{n} = Q^{n} (α^{*})$ . Now, we bound each term separately. The fourth term, A₄, is the easiest to bound, using simply the incoherence condition:

{∣ ‖ A_{4} ‖ ∣}_{\infty} \leq 1 - ε .

To bound the other terms, we need the following lemma:

Lemma 11

For any ≥ 0 and constants K and K′, the following bounds hold:

P [{∣ ‖ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ‖ ∣}_{\infty} \geq δ] \leq 2 \exp (- K \frac{n δ^{2}}{d^{2}} + \log d + l o g (p - d))

(28)

P [{∣ ‖ Q_{S S}^{n} - Q_{S S}^{*} ‖ ∣}_{\infty} \geq δ] \leq 2 \exp (- K \frac{n δ^{2}}{d^{2}} + 2 \log d)

(29)

P [{∣ ‖ {(Q_{S S}^{n})}^{- 1} - {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{\infty} \geq δ] \leq 4 \exp (- K \frac{n δ}{d^{3}} - K^{'} \log d)

(30)

Proof

We start by proving the first confidence interval. By definition of infinity norm of a matrix, we have:

P [{∣ ‖ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ‖ ∣}_{\infty} \geq δ] = P [\max_{j \in S^{c}} \sum_{k \in S} ∣ z_{j k} ∣ \geq δ] \leq (p - d) P [\sum_{k \in S} ∣ z_{i j} ∣ \geq δ],

where $z_{j k} = {[Q^{n} - Q^{*}]}_{j k}$ and, for the last inequality, we used the union bound and the fact that S^c ≤ p − d. Furthermore,

\begin{matrix} P [\sum_{k \in S} ∣ z_{j k} ∣ \geq δ] & \leq P [\exists k \in S ∣ ∣ z_{j k} ∣ \geq δ ∕ d] \\ \leq d P [∣ z_{j k} ∣ \geq δ ∕ d] . \end{matrix}

Thus,

P [{∣ ‖ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ‖ ∣}_{\infty} \geq δ] \leq (p - d) d P [∣ z_{j k} ∣ \geq δ ∕ d] .

At this point, we can obtain the first confidence bound by using Eq. 26 with β = δ/d in the above equation. The proof of the second confidence bound is very similar and we omit it for brevity. To prove the last confidence bound, we proceed as follows:

\begin{matrix} {∣ ‖ {(Q_{S S}^{n})}^{- 1} - {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{\infty} \\ = {∣ ‖ {(Q_{S S}^{n})}^{- 1} [Q_{S S}^{n} - Q_{S S}^{*}] {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{\infty} \\ \leq \sqrt{d} {∣ ‖ {(Q_{S S}^{n})}^{- 1} Q_{S S}^{n} - Q_{S S}^{*}] {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{2} \\ \leq \sqrt{d} {∣ ‖ {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{2} {∣ ‖ Q_{S S}^{n} - Q_{S S}^{*} ‖ ∣}_{2} {∣ ‖ {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{2} \\ \leq \frac{\sqrt{d}}{C_{\min}} {∣ ‖ Q_{S S}^{n} - Q_{S S}^{*} ‖ ∣}_{2} {∣ ‖ {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{2} . \end{matrix}

Next, we bound each term of the final expression in the above equation separately. The first term can be bounded using Eq. 27:

\begin{matrix} P [{∣ ‖ Q_{S S}^{n} - Q_{S S}^{*} ‖ ∣}_{2} \geq C_{m i n}^{2} δ ∕ 2 \sqrt{d}] \\ \leq 2 \exp (- K \frac{n δ^{2}}{d^{3}} + 2 \log d), \end{matrix}

The second term can be bounded using Lemma 6:

P [{∣ ‖ {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{2} \geq \frac{2}{C_{\min}}] = P [Λ_{\min} (Q_{S S}^{n}) \leq \frac{C_{\min}}{2}] \leq \exp (- K \frac{n}{d^{2}} + B \log d) .

Then, the third confidence bound follows.

Control of A₁. We start by rewriting the term A₁ as

A_{1} = Q_{S^{c} S}^{*} {(Q_{S S}^{*})}^{- 1} [(Q_{S S}^{*}) - (Q_{S S}^{n})] {(Q_{S S}^{n})}^{- 1},

and further,

{∣ ‖ A_{1} ‖ ∣}_{\infty} \leq {∣ ‖ Q_{S^{c} S}^{*} {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{\infty} \times {∣ ‖ (Q_{S S}^{*}) - (Q_{S S}^{n}) ‖ ∣}_{\infty} {∣ ‖ {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{\infty} .

Next, using the incoherence condition easily yields:

{∣ ‖ A_{1} ‖ ∣}_{\infty} \leq (1 - ε) {∣ ‖ (Q_{S S}^{*}) - (Q_{S S}^{n}) ‖ ∣}_{\infty} \times \sqrt{d} {∣ ‖ {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{2}

Now, we apply Lemma 6 with δ = C_min/2 to have that ${∣ ∣ ∣ {(Q_{S S}^{n})}^{- 1} ∣ ∣ ∣}_{2} \leq \frac{2}{C_{\min}}$ with probability greater than $1 - \exp (- K n ∕ d^{2} + K^{'} \log d)$ , and then use Eq. 30 with $δ = \frac{ε C_{\min}}{12 \sqrt{d}}$ to conclude that

P [{∣ ‖ A_{1} ‖ ∣}_{\infty} \geq \frac{ε}{6}] \leq 2 \exp (- K n d^{3} + K^{'} \log d) .

Control of A₂. We rewrite the term A₂ as ${∣ ∣ ∣ A_{2} ∣ ∣ ∣}_{\infty} \leq {∣ ∣ ∣ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ∣ ∣ ∣}_{\infty} {∣ ∣ ∣ {(Q_{S S}^{n})}^{- 1} - {(Q_{S S}^{*})}^{- 1} ∣ ∣ ∣}_{\infty}$ , and then use Eqs. 28 and 29 with $δ = \sqrt{ε ∕ 6}$ to conclude that

P [{∣ ‖ A_{2} ‖ ∣}_{\infty} \geq \frac{ε}{6}] \leq 4 \exp (- K \frac{n}{d^{3}} + \log (p - d) + K^{'} \log p) .

Control of A₃. We rewrite the term A₃ as

\begin{matrix} {∣ ‖ A_{3} ‖ ∣}_{\infty} & = \sqrt{d} {∣ ‖ {(Q_{S S}^{*})}^{- 1} ‖ ∣}_{2} {∣ ‖ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ‖ ∣}_{\infty} \\ \leq \frac{\sqrt{d}}{C_{\min}} {∣ ‖ Q_{S^{c} S}^{n} - Q_{S^{c} S}^{*} ‖ ∣}_{\infty} . \end{matrix}

We then apply Eq. 28 with $δ = \frac{ε C_{\min}}{6 \sqrt{d}}$ to conclude that

P [{∣ ‖ A_{3} ‖ ∣}_{\infty} \leq \frac{ε}{6} \leq \exp (- K \frac{n}{d^{3}} + \log (p - d)),

and thus,

P [{∣ ‖ Q_{S^{c} S}^{n} {(Q_{S S}^{n})}^{- 1} ‖ ∣}_{\infty} \geq 1 - \frac{ε}{2}] = O (\exp (- K \frac{n}{d^{3}} + \log p)) .

J. Additional experiments

Parameters

(n, p, d). Figure 5 shows the success probability at inferring the incoming links of nodes on the same type of canonical networks as depicted in Fig. 2. We choose nodes the same in-degree but different super-neighboorhod set sizes p_i and experiment with different scalings β of the number of cascades n = 10 βd log p. We set the regularization parameter λ_n as a constant factor of $\sqrt{\log (p) ∕ n}$ as suggested by Theorem 2 and, for each node, we used cascades which contained at least one node in the super-neighborhood of the node under study. We used an exponential transmission model and time window T = 10. As predicted by Theorem 2, very different p values lead to curves that line up with each other quite well.

Success probability vs. # of cascades. Different super-neighborhood sizes *p_i*.

Figure 6 shows the success probability at inferring the incoming links of nodes of a hierarchical Kronecker network with equal super neighborhood size (p_i = 70) but different in-degree (d_i) under different scalings β of the number of cascades n = 10 d log p and choose the regularization parameter λ_n as a constant factor of $\sqrt{\log (p) ∕ n}$ as suggested by Theorem 2. We used an exponential transmission model and time window T = 5. As predicted by Theorem 2, in this case, different d values lead to noticeably different curves.

Success probability vs. # of cascades. Different in-degrees *d_i*.

Comparison with NetRate and First-Edge

Figure 7 compares the accuracy of our algorithm, NETRATE and First-Edge against number of cascades for different type of networks and transmission models. Our method typically outperforms both competitive methods. We find especially striking the competitive advantage with respect to First-Edge, however, this may be explained by comparing the sample complexity results for both methods: First-Edge needs O(Nd log N) cascades to achieve a probability of success approaching 1 in a rate polynomial in the number of cascades while our method needs O(d³ log N) to achieve a probability of success approaching 1 in a rate exponential in the number of cascades.

Algorithm 1.

$ℓ_{1}$ -regularized network inference

Require: Cⁿ, λ_n, K, L

for all

i \in V

k = 0

while k < K do

α_{i}^{k + 1} = {(α_{i}^{k} - L \nabla_{α i} ℓ^{n} (α_{i}^{k}) - λ_{n} L)}_{+}

k = k + 1

end while

{\hat{α}}_{i} = α_{i}^{K - 1}

end for

return

{{\hat{α}}_{i}}_{i \in V}

Open in a new tab

Acknowledgement

This research was supported in part by NSF/NIH BIG-DATA 1R01GM108341-01, NSF IIS1116886, and a Raytheon faculty fellowship to L. Song.

Footnotes

Proceedings of the 31 ^st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32.

Contributor Information

Hadi Daneshmand, Email: HADI.DANESHMAND@TUE.MPG.DE.

Manuel Gomez-Rodriguez, Email: MANUELGR@TUE.MPG.DE.

Le Song, Email: LSONG@CC.GATECH.EDU.

Bernhard Schölkopf, Email: BS@TUE.MPG.DE.

References

Abrahao B, Chierichetti F, Kleinberg R, Panconesi A. Trace complexity of network inference. KDD. 2013 [Google Scholar]
Adar E, Adamic LA. Tracking Information Epidemics in Blogspace. Web Intelligence. 2005:207–214. [Google Scholar]
Barabási A-L, Albert R. Emergence of Scaling in Random Networks. Science. 1999;286:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
Beck A, Teboulle M. Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications. 2009 [Google Scholar]
Boyd SP, Vandenberghe L. Convex optimization. Cambridge University Press; 2004. [Google Scholar]
Du N, Song L, Smola A, Yuan M. Learning Networks of Heterogeneous Influence. NIPS. 2012a [Google Scholar]
Du N, Song L, Woo H, Zha H. Uncover Topic-Sensitive Information Diffusion Networks. AISTATS. 2012b [Google Scholar]
Gomez-Rodriguez M, Leskovec J, Krause A. Inferring Networks of Diffusion and Influence. KDD. 2010 [Google Scholar]
Gomez-Rodriguez M, Balduzzi D, Schölkopf B. Uncovering the Temporal Dynamics of Diffusion Networks. ICML. 2011 [Google Scholar]
Gomez-Rodriguez Manuel. Ph.D. Thesis. Stanford University & MPI for Intelligent Systems; 2013. [Google Scholar]
Gripon V, Rabbat M. Reconstructing a graph from path traces. arXiv. 2013;1301.6916 [Google Scholar]
Kempe D, Kleinberg JM, Tardos É. Maximizing the Spread of Influence Through a Social Network. KDD. 2003 [Google Scholar]
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z. Kronecker Graphs: An Approach to Modeling Networks. JMLR. 2010 [Google Scholar]
Mangasarian OL. A simple characterization of solution sets of convex programs. Operations Research Letters. 1988;7(1):21–26. [Google Scholar]
Netrapalli P, Sanghavi S. Finding the Graph of Epidemic Cascades. ACM SIGMETRICS. 2012 [Google Scholar]
Newey WK, McFadden DL. Large Sample Estimation and Hypothesis Testing. Handbook of Econometrics. 1994;4:2111–2245. [Google Scholar]
Parikh Neal, Boyd Stephen. Proximal algorithms. Foundations and Trends in Optimization. 2013 [Google Scholar]
Ravikumar P, Wainwright MJ, Lafferty JD. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics. 2010;38(3):1287– 1319. [Google Scholar]
Rogers EM. Diffusion of Innovations. fourth edition Free Press; New York: 1995. [Google Scholar]
Saito K, Kimura M, Ohara K, Motoda H. Learning continuous-time information diffusion model for social behavioral data analysis. Advances in Machine Learning. 2009:322–337. [Google Scholar]
Snowsill T, Fyson N, Bie T. De, Cristianini N. Refining Causality: Who Copied From Whom? KDD. 2011 [Google Scholar]
Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE Transactions on Information Theory. 2009;55(5):2183–2202. [Google Scholar]
Wang L, Ermon S, Hopcroft J. Feature-enhanced probabilistic models for diffusion network inference. ECML PKDD. 2012 [Google Scholar]

[R1] Abrahao B, Chierichetti F, Kleinberg R, Panconesi A. Trace complexity of network inference. KDD. 2013 [Google Scholar]

[R2] Adar E, Adamic LA. Tracking Information Epidemics in Blogspace. Web Intelligence. 2005:207–214. [Google Scholar]

[R3] Barabási A-L, Albert R. Emergence of Scaling in Random Networks. Science. 1999;286:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]

[R4] Beck A, Teboulle M. Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications. 2009 [Google Scholar]

[R5] Boyd SP, Vandenberghe L. Convex optimization. Cambridge University Press; 2004. [Google Scholar]

[R6] Du N, Song L, Smola A, Yuan M. Learning Networks of Heterogeneous Influence. NIPS. 2012a [Google Scholar]

[R7] Du N, Song L, Woo H, Zha H. Uncover Topic-Sensitive Information Diffusion Networks. AISTATS. 2012b [Google Scholar]

[R8] Gomez-Rodriguez M, Leskovec J, Krause A. Inferring Networks of Diffusion and Influence. KDD. 2010 [Google Scholar]

[R9] Gomez-Rodriguez M, Balduzzi D, Schölkopf B. Uncovering the Temporal Dynamics of Diffusion Networks. ICML. 2011 [Google Scholar]

[R10] Gomez-Rodriguez Manuel. Ph.D. Thesis. Stanford University & MPI for Intelligent Systems; 2013. [Google Scholar]

[R11] Gripon V, Rabbat M. Reconstructing a graph from path traces. arXiv. 2013;1301.6916 [Google Scholar]

[R12] Kempe D, Kleinberg JM, Tardos É. Maximizing the Spread of Influence Through a Social Network. KDD. 2003 [Google Scholar]

[R13] Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z. Kronecker Graphs: An Approach to Modeling Networks. JMLR. 2010 [Google Scholar]

[R14] Mangasarian OL. A simple characterization of solution sets of convex programs. Operations Research Letters. 1988;7(1):21–26. [Google Scholar]

[R15] Netrapalli P, Sanghavi S. Finding the Graph of Epidemic Cascades. ACM SIGMETRICS. 2012 [Google Scholar]

[R16] Newey WK, McFadden DL. Large Sample Estimation and Hypothesis Testing. Handbook of Econometrics. 1994;4:2111–2245. [Google Scholar]

[R17] Parikh Neal, Boyd Stephen. Proximal algorithms. Foundations and Trends in Optimization. 2013 [Google Scholar]

[R18] Ravikumar P, Wainwright MJ, Lafferty JD. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics. 2010;38(3):1287– 1319. [Google Scholar]

[R19] Rogers EM. Diffusion of Innovations. fourth edition Free Press; New York: 1995. [Google Scholar]

[R20] Saito K, Kimura M, Ohara K, Motoda H. Learning continuous-time information diffusion model for social behavioral data analysis. Advances in Machine Learning. 2009:322–337. [Google Scholar]

[R21] Snowsill T, Fyson N, Bie T. De, Cristianini N. Refining Causality: Who Copied From Whom? KDD. 2011 [Google Scholar]

[R22] Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE Transactions on Information Theory. 2009;55(5):2183–2202. [Google Scholar]

[R23] Wang L, Ermon S, Hopcroft J. Feature-enhanced probabilistic models for diffusion network inference. ECML PKDD. 2012 [Google Scholar]

PERMALINK

Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Hadi Daneshmand

Manuel Gomez-Rodriguez

Le Song

Bernhard Schölkopf

Abstract

1. Introduction

Overview of results

Related work

2. Continuous-Time Diffusion Model

2.1. Cascade generative process

Figure 1.

2.2. Likelihood of a cascade

3. Network Inference Problem

Table 1.

4. Consistency

Can we recover the hidden network structures from the observed cascades?

Theorem 1

Proof

5. Recovery Conditions

Condition 1 (Dependency condition)

Condition 2 (Incoherence condition)

Condition 3 (Lipschitz Continuity)

Condition 4 (Boundedness)

Remarks for condition 1

Remarks for condition 2

Figure 2.

Remarks for conditions 3 and 4

6. Sample Complexity

How many cascades do we need to recover the network structure?

Theorem 2

Remarks

6.1. Outline of Analysis

Lemma 3

Lemma 4

Lemma 5

Lemma 6

Lemma 7

Lemma 8

7. Efficient soft-thresholding algorithm

Can we design efficient algorithms to solve Eq. (4) for network recovery?

8. Experiments

Experimental Setup

Parameters

Figure 3.

Regularization parameter

Comparison with NetRate and First-Edge

Figure 4.

9. Conclusions

A. Proof of Lemma 9

Lemma 9

Proof

B. Proof of Lemma 10

Lemma 10

Proof

• Nodes in R

• Nodes in U

C. Proof of Eq 7

D. Proof of Lemma 3

E. Proof of Lemma 4

F. Proof of Lemma 5

G. Proof of Lemma 6

H. Proof of Lemma 7

I. Proof of Lemma 8

Lemma 11

Proof

J. Additional experiments

Parameters

Figure 5.

Figure 6.

Comparison with NetRate and First-Edge

Figure 7.

Algorithm 1.

Acknowledgement

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

• Nodes in $R$

• Nodes in $U$