Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 27.
Published in final edited form as: J Mach Learn Res. 2020;21:58.

Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

Artin Spiridonoff 1, Alex Olshevsky 1, Ioannis Ch Paschalidis 1
PMCID: PMC7520166  NIHMSID: NIHMS1608067  PMID: 32989377

Abstract

We consider the standard model of distributed optimization of a sum of functions F(z)=i=1nfi(z), where node i in a network holds the function fi(z). We allow for a harsh network model characterized by asynchronous updates, message delays, unpredictable message losses, and directed communication among nodes. In this setting, we analyze a modification of the Gradient-Push method for distributed optimization, assuming that (i) node i is capable of generating gradients of its function fi(z) corrupted by zero-mean bounded–support additive noise at each step, (ii) F(z) is strongly convex, and (iii) each fi(z) has Lipschitz gradients. We show that our proposed method asymptotically performs as well as the best bounds on centralized gradient descent that takes steps in the direction of the sum of the noisy gradients of all the functions f1(z), …, fn(z) at each step.

Keywords: distributed optimization, stochastic gradient descent

1. Introduction

Distributed systems have attracted much attention in recent years due to their many applications such as large scale machine learning (e.g., in the healthcare domain, Brisimi et al., 2018), control (e.g., maneuvering of autonomous vehicles, Peng et al., 2017), sensor networks (e.g., coverage control, He et al., 2015) and advantages over centralized systems, such as scalability and robustness to faults. In a network comprised of multiple agents (e.g., data centers, sensors, vehicles, smart phones, or various IoT devices) engaged in data collection, it is sometimes impractical to collect all the information in one place. Consequently, distributed optimization techniques are currently being explored for potential use in a variety of estimation and learning problems over networks.

This paper considers the separable optimization problem

minzdF(z)i=1nfi(z), (1)

where the function fi:d is held only by agent i in the network. We assume the agents communicate through a directed communication network, with each agent able to send messages to its out-neighbors. The agents seek to collaboratively agree on a minimizer to the global function F(z).

This fairly simple problem formulation is capable of capturing a variety of scenarios in estimation and learning. Informally, z is often taken to parameterize a model, and fi(z) is a loss function measuring how well z matches the data held by agent i. Agreeing on a minimizer of F(z) means agreeing on a model that best explains all the data throughout the network – and the challenge is to do this in a distributed manner, avoiding techniques such as flooding which requires every node to learn and store all the data throughout the network. For more details, we refer the reader to the recent survey by Nedic et al. (2018).

In this work, we will consider a fairly harsh network environment, including message losses, delays, asynchronous updates, and directed communication. The function F(z) will be assumed to be strongly convex with the individual functions fi(z) having a Lipschitz continuous gradient. We will also assume that, at every time step, node i can obtain a noisy gradient of its function fi(z). Our goal will be to investigate to what extent distributed methods can remain competitive with their centralized counterparts in spite of these obstacles.

1.1. Literature Review

Research on models of distributed optimization dates back to the 1980s, see Tsitsiklis et al. (1986). The separable model of (1) was first formally analyzed in Nedic and Ozdaglar (2009), where performance guarantees on a fixed-stepsize subgradient method were obtained. The literature on the subject has exploded since, and we review here only the papers closely related to our work. We begin by discussing works that have focused on the effect of harsh network conditions.

A number of recent papers have studied asynchronicity in the context of distributed optimization. It has been noted that asynchronous algorithms are often preferred to synchronous ones, due to the difficulty of perfectly coordinating all the agents in the network, e.g., due to clock drift. Papers by Recht et al. (2011); Li et al. (2014); Agarwal and Duchi (2011); Lian et al. (2015) and Feyzmahdavian et al. (2016) study asynchronous parallel optimization methods in which different processors have access to a shared memory or parameter server. Recht et al. (2011) present a scheme called HOGWILD!, in which processors have access to the same shared memory with the possibility of overwriting each other’s work. Li et al. (2014) proposes a parameter server framework for distributed machine learning. Agarwal and Duchi (2011) analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information due to asynchrony. Lian et al. (2015) improve on the earlier work by Agarwal and Duchi (2011), and study two asynchronous parallel implementations of Stochastic Gradient (SG) for nonconvex optimization; establishing an Ok(1/k) convergence rate for both algorithms. Feyzmahdavian et al. (2016) propose an asynchronous mini-batch algorithm that eliminates idle waiting and allows workers to run at their maximal update rates.

The works mentioned above consider a centralized network topology, i.e., there is a central node (parameter server or shared memory) connected to all the other nodes. On the other hand, in a decentralized setting, nodes communicate with each other over a connected network without depending on a central node (see Figure 1). This setting reduces the communication load on the central node, is not vulnerable to failures of that node, and is more easily scalable.

Figure 1:

Figure 1:

Different network topologies.

For analysis of how decentralized asynchronous methods perform we refer the reader to Mansoori and Wei (2017); Tsitsiklis et al. (1986); Srivastava and Nedic (2011); Assran and Rabbat (2018); Nedic (2011); Wu et al. (2018) and Tian et al. (2018). We note that of these works only Tian et al. (2018) is able to obtain an algorithm which agrees on a global minimizer of (1) with non-random asynchronicity, under the assumptions of strong convexity, noiseless gradients and possible delays. On the other hand, the papers Nedic (2011) and Wu et al. (2018) obtain convergence in this situation under assumptions of natural randomness in the algorithm: the former assumes randomly failing links while the latter assumes that nodes make updates in random order.

The study of distributed separable optimization over directed graphs was initiated in Tsianos et al. (2012b), where a distributed approach based on dual averaging with convex functions over a fixed graph was proposed and shown to converge at an Ok(1/k) rate. Some numerical results for such methods were reported in Tsianos et al. (2012a). In Nedic and Olshevsky (2015), a method based on plain gradient descent converging at a rate of Ok((lnk)/k) was proposed over time-varying graphs. This was improved in Nedic and Olshevsky (2016) to Ok((lnk)/k) for strongly convex functions with noisy gradient samples. More recent works on optimization over directed graphs are Akbari et al. (2017), which considered online convex optimization in this setting, and Assran and Rabbat (2018), which considered combining directed graphs with delays and asynchronicity. The main tool for distributed optimization is the so-called “push sum” method introduced in Kempe et al. (2003), which is widely used to design communication and optimization schemes over directed graphs. More recent references are Bénézit et al. (2010); Hadjicostis et al. (2016), which provide a more modern and general analysis of this method, and the most comprehensive reference on the subject is the recent monograph by Hadjicostis et al. (2018). We also mention Xi and Khan (2017a); Xi et al. (2018); Nedic et al. (2017), where an approach based on push-sum was explored. A parallel line of work in this setting based on the the ADMM model, where updates are allowed to include a local minimization step, was explored in Brisimi et al. (2018); Chang et al. (2016a,b) and Hong (2017).

The reason directed graphs present a problem is because much of distributed optimization relies on the primitive of “multiplication by a doubly stochastic matrix:” given that each node of a network holds a number xi, the network needs to compute yi where x = (x1, …, xn), y = (y1, …, yn) and y = Wx for some doubly stochastic matrix W with positive spectral gap. This is pretty easy to accomplish over undirected graphs (see Nedic et al., 2018) but not immediate over directed graphs. A parallel line of research focuses on distributed methods for constructing such doubly stochastic matrices over directed graphs – we refer the reader to Dominguez-Garcia and Hadjicostis (2013); Gharesifard and Cortés (2012); Domínguez-García and Hadjicostis (2014). Unfortunately, to the authors’ best knowledge, no explicit and favorable convergence time guarantees are known for this procedure. Another line of work (Xi and Khan, 2017b) takes a similar approach, based on construction of a doubly stochastic matrix with positive spectral gap after the introduction of auxiliary states. Among works with undirected graphs, Scaman et al. (2017) derived the optimal convergence rates for smooth and strongly convex functions and introduced the multi-step dual accelerated (MSDA) algorithm with optimal linear convergence rate in the deterministic case.

Dealing with message losses has always been a challenging problem for multi-agent optimization protocols. Recently, Hadjicostis et al. (2016) resolved this issue rather elegantly for the problem of distributed average computation by having nodes exchange certain running sums. It was shown in Hadjicostis et al. (2016) that the introduction of these running sums is equivalent to a lossless algorithm on a slightly modified graph. We also refer the reader to the follow-up papers by Su and Vaidya (2016b,a, 2017). We will use the same approach in this work to deal with message losses.

In many applications, calculating the exact gradients can be computationally very expensive or impossible Lan et al. (2018). In one possible scenario, nodes are sensors that collect measurements at every step, which naturally corrupts all the data with noise. Alternatively, communication between agents may insert noise into information transmitted between them. Finally, when fi(z) measures the fit of a model parameterized by the vector z to the data of agent i, it may be efficient for agent i to randomly select a subset of its data and compute an estimate of the gradient based on only those data points (Alpcan and Bauckhage, 2009). Motivated by these considerations, a literature has arisen studying the effects of stochasticity in the gradients. For example, Srivastava and Nedic (2011) showed convergence of an asynchronous algorithm for constrained distributed stochastic optimization, under the presence of local noisy communication in a random communication network. In Pu and Nedic (2018), two distributed stochastic gradient methods were introduced, and their convergence to a neighborhood of the global minimum (under constant step-size) and to the global minimum (under diminishing stepsize) was analyzed. In work by Sirb and Ye (2016), convergence of asynchronous decentralized optimization using delayed stochastic gradients has been shown.

The algorithms we will study here for stochastic gradient descent are based on the standard “consensus+gradient descent” framework: nodes will take steps in the direction of their gradients and then “reconcile” these steps by moving in the directions of an average of their neighbors in the graph. We refer the reader to Nedic et al. (2018); Yuan et al. (2016), for a more recent and simplified analysis of such methods. It is also possible to take a more modern approach, pioneered in Shi et al. (2015), of using the past history to make updates; such schemes have been shown to achieve superior performance in recent years (see Shi et al., 2015; Sun et al., 2016; Oreshkin et al., 2010; Nedic et al., 2017; Xi and Khan, 2017a; Xi et al., 2018; Qu and Li, 2017; Xu et al., 2015; Qu and Li, 2019; Di Lorenzo and Scutari, 2016); we refer the reader to Pu and Nedic (2018) which took this approach.

One of our main concerns in this paper is to develop decentralized optimization methods which perform as well as their centralized counterparts. Specifically, we will compare the performance of a distributed method for (1) on a network of n nodes with the performance of a centralized method which, at every step, can query all n gradients of the functions f1(z), …, fn(z). Since the distributed algorithm gets noise-corrupted gradients, so should the centralized method. Thus, the natural approach is to compare the distributed method to centralized gradient descent which moves in the direction of the sum of the gradients of f1(z), …, fn(z).This method of comparison keeps the “computational power” of the two nodes identical.

Traditionally, the bounds derived on distributed methods were considerably worse than those derived for centralized methods. For example, the papers by Nedic and Olshevsky (2015, 2016) had bounds for distributed optimization over directed graphs that were worse than the comparable centralized method (in terms of rate of error decay) by a multiplicative factor that, in the worst case, could be as large as nO(n). This is typical over directed graphs, though better results are possible over undirected graphs. For example, in Olshevsky (2017), in the model of noiseless, undelayed, synchronous communication over an undirected graph, a distributed subgradient method was proposed whose performance, relative to a centralized method with the same computational power, was worse by a multiplicative factor of n.

The breakthrough papers by Chen and Sayed (2015); Pu and Garcia (2017); Morral et al. (2017), were the first to address this gap. These papers studied the model where gradients are corrupted by noise, which we also consider in this paper. Chen and Sayed (2015) examined the mean-squared stability and convergence of distributed strategies with fixed step-size over graphs and showed the same performance level as that of a centralized strategy, in the small step-size regime. In Pu and Garcia (2017) it was shown that, for a certain stochastic differential equation paralleling network gradient descent, the performance of centralized and distributed methods were comparable. In Morral et al. (2017), it was proved, for the first time, that distributed gradient descent with an appropriately chosen step-size, asymptotically performs similarly to a centralized method that takes steps in the direction of the sum of the noisy gradients, assuming iterates will remain bounded almost surely. This was the first analysis of a decentralized method for computing the optimal solution with performance bounds matching its centralized counterpart.

Both Pu and Garcia (2017) and Morral et al. (2017) were over fixed, undirected graphs with no message loss or delays or asynchronicity. As shown in the paper by Morral et al. (2012), this turns out to be a natural consequence of the analysis of those methods. Indeed, on a technical level, the advantage of working over undirected graphs is that they allow for easy distributed multiplication by doubly-stochastic matrices; it was shown in Morral et al. (2012) that if this property holds only in expectation – that is, if the network nodes can multiply by random stochastic matrices that are only doubly stochastic in expectation – distributed gradient descent will not perform comparably to its centralized counterpart.

In parallel to this work, and in order to reduce communication bottlenecks, Koloskova et al. (2019) propose a decentralized SGD with communication compression that can achieve the centralized baseline convergence rate, up to a constant factor. When the objective functions are smooth but not necessarily convex, Lian et al. (2017) show that Decentralized Parallel Stochastic Gradient Descent (D-PSGD) can asymptotically perform comparably to Centralized PSGD in total computational complexity. However, they argue that D-PSGD requires much less communication cost on the busiest node and hence, can outperform C-PSGD in certain communication regimes. Again, both Koloskova et al. (2019) and Lian et al. (2017) are over fixed undirected graphs, without delays, link failures or asynchronicity. The follow-up work by Lian et al. (2018), extends the D-PSGD to the asynchronous case.

1.2. Our Contribution

We propose an algorithm which we call Robust Asynchronous Stochastic Gradient Push (RASGP) for distributed optimization from noisy gradient samples over directed graphs with message losses, delays, and asynchronous updates. We will assume gradients are corrupted with additive noise represented by independent random variables, with bounded support, and with finite variance at node i denoted by σi2. Our main result is that the RASGP performs as well as the best bounds on centralized gradient descent that moves in the direction of the sum of noisy gradients of f1(z), …, fn(z). Our results also hold if the underlying graphs are time-varying as long as there are no message losses. We give a brief technical overview of this result next.

We will assume that each function fi(z) is μi-strongly convex with Li-Lipschitz gradient, where ∑i μi and Li > 0, i = 1, …, n. The RASGP will have every node maintain an estimate of the optimal solution which will be updated from iteration to iteration; we will use zi(k) to denote the value of this estimate held by node i at iteration k. We will show that, for each node i = 1, …, n,

E[zi(k)z*22]=Γui=1nσi2k(i=1nμi)2+Ok(1k1.5), (2)

where z* ≔ arg min F(z) and Γu is the degree of asynchronicity, defined as the maximum number of iterations between two consecutive updates of any agent. The leading term matches the best bounds for (centralized) gradient descent that takes steps in the direction of the sum of the noisy gradients of f1(z), …, fn(z), every ku iterations (see Nemirovski et al., 2009; Rakhlin et al., 2012). Asymptotically, the performance of the RASGP is network independent: indeed, the only effect of the network or the number of nodes is on the constant factor within the Ok(1/k1.5) term above. The asymptotic scaling as Ok(1/k) is optimal in this setting (Rakhlin et al., 2012).

Consider the case when all the functions are identical, i.e., f1(z) = ⋯ = fn(z), and Γu = 1. In this case, letting μ = μi and σ = σi, we have that for each i = 1, …, n, (2) reduces to

E[zi(k)z*22]=σ2/nkμ2+Ok(1k1.5).

In other words, asymptotically we get the variance reduction of a centralized method that simply averages the n noisy gradients at each step.

The implication of this result is that one can get the benefit of having n independent processors computing noisy gradients in spite of all the usual problems associated with communications over a network (i.e., message losses, latency, asynchronous updates, oneway communication). Of course, the caveat is that one must wait sufficiently long for the asymptotic decay to “kick in,” i.e., for the second term on the right-hand side of (2) to become negligible compared to the first. We leave the analysis of the size of this transient period to future work and note here that it will depend on the network and the number of nodes.1

The RASGP is a variation on the usual distributed gradient descent where nodes mix consensus steps with steps in the direction of their own gradient, combined with a new step-size trick to deal with asynchrony. It is presented as Algorithm 3 in Section 3. For a formal statement of the results presented above, we refer the reader to Theorem 15 in the body of the paper.

We briefly mention two caveats. The first is that implementation of the RASGP requires each node to use the quantity i=1nμi/n in setting its local stepsize. This is not a problem in the setting when all functions are the same, but, otherwise, i=1nμi/n is a global quantity not immediately available to each node. Assuming that node i knows μi one possibility is to use average consensus to compute this quantity in a distributed manner before running the RASGP (for example using the algorithm described in Section 2 of this paper). The second caveat is that, like all algorithms based on the push-sum method, the RASGP requires each node to know its out-degree in the communication graph.

1.3. Organization of This Paper

We conclude this Introduction with Section 1.4, which describes the basic notation we will use throughout the remainder of the paper. Section 2 does not deal directly with the distributed optimization problem we have discussed, but rather introduces the problem of computing the average in the fairly harsh network setting we will consider in this paper. This is an intermediate problem we need to analyze on the way to our main result. Section 3 provides the RASGP algorithm for distributed optimization, and then states and proves our main result, namely the asymptotically network-independent and optimal convergence rate. Results from numerical simulations of our algorithm to illustrate its performance are provided in Section 4, followed by conclusions in Section 5.

1.4. Notations and Definitions

We assume there are n agents V={1,...,n}, communicating through a fixed directed graph G=(V,E), where E is the set of directed arcs. We assume G does not have self-loops and is strongly connected.

For a matrix A, we will use Aij to denote its (i,j)th entry. Similarly, vi and [v]i will denote the ith entry of a vector v. A matrix is called stochastic if it is non-negative and the sum of the elements of each row equals to one. A matrix is column stochastic if its transpose is stochastic. To a non-negative matrix An×n we associate a directed graph GA with vertex set VA={1,2,...,n} and edge set EA={(i,j)|Aji>0}. In general, such a graph might contain self-loops. Intuitively, this graph corresponds to the information flow in the update x(k + 1) = Ax(k); indeed, (i,j)EA if the jth coordinate of x(k + 1) depends on the ith coordinate of x(k) in this update.

Given a sequence of matrices A(0), A(1), A(2), …, we denote by Ak2:k1, k2k1, the product of elements k1 to k2 of the sequence, inclusive, in the following order:

Ak2:k1=A(k2)A(k21)A(k1).

Moreover, Ak:k = A(k).

Node i is an in-neighbor of node j, if there is a directed link from i to j. Hence, j would be an out-neighbor of node i. We denote the set of in-neighbors and out-neighbors of node i by Ni and Ni+, respectively. Moreover, we denote the number of in-neighbors and out-neighbors of node i with di and di+, as its in-degree and out-degree, respectively.

By xmin and xmax we denote mini xi and maxi xi respectively, over all possible indices unless mentioned otherwise. We denote a n × 1 column vector of all ones or zeros by 1n and 0n, respectively. We will remove the subscript when the size is clear from the context.

Let vd be a vector. We denote by vd a vector of the same length such that

vi={1/vi,0,   ifvi0,ifvi=0.

For all the algorithms we describe, we sometimes use the notion of mass to denote the value an agent holds, sends or receives. With that in mind, we can think of a value being sent from one node, as a mass being transferred.

We use ‖.‖p to denote the lp-norm of a vector. We sometimes drop the subscript when referring to the Euclidean l2 norm.

2. Push-Sum with Delays and Link Failures

In this section we introduce the Robust Asynchronous Push-Sum algorithm (RAPS) for distributed average computation and prove its exponential convergence. Convergence results proved for this algorithm will be used later when we turn to distributed optimization. The algorithm relies heavily on ideas from Hadjicostis et al. (2016) to deal with message losses, delays, and asynchrony. The conference version of this paper Olshevsky et al. (2018) developed RAPS for the delay-free case, and this section may be viewed as an extension of that work.

Pseudocode for the algorithm is given in the box for Algorithm 1. We begin by outlining the operation of the algorithm. Our goal in this section is to compute the average of vectors, one held by each node in the network, in a distributed manner. However, since the RAPS algorithm acts separately in each component, we may, without loss of generality, assume that we want to average scalars rather than vectors. The scalar held by node i will be denoted by xi(0).

Without loss of generality, we define an iteration by descretizing time into time slots indexed by k = 0,1,2,…. We assume that during each time slot every agent makes at most one update and processes messages sent in previous time slots.

In the setting of no message losses, no delays, no asynchrony, and a fixed, regular, undirected communication graph, the RAPS can be shown to be equivalent to the much simpler iteration

x(t+1)=Wx(t),

where W is an irreducible, doubly stochastic matrix with positive diagonal; standard Markov chain theory implies that xi(t)(1/n)i=1nxi(t) in this setting. RAPS does essentially the same linear update, but with a considerable amount of modifications. In particular, we use the central idea of the classic push-sum method (Kempe et al., 2003) to deal with directed communication, which suggests to have a separate update equation for the y-variables, which informs us how we should rescale the x-variables; as well as the central idea of Hadjicostis et al. (2018), which is to repeatedly broadcast sums of previous messages to provide robustness against message loss. While the algorithm in Hadjicostis et al. (2018) handles message losses in a synchronous setting, RAPS can handle delays as well as asynchronicity.

Before getting into details, let us provide a simple intuition behind the RAPS algorithm. Each agent i holds a value (mass) xi and yi. At the beginning of every iteration, i wants to split its mass between itself and its out-neighbors jNi+. However, to handle message losses, it sends the accumulated x and y mass (running sums which we denote by ϕix and ϕiy), that i wants to transfer to each of its neighbors, from the start of the algorithm. Therefore, when a neighbor j receives a new accumulated mass from i, it stores it at ρji and by subtracting the previous accumulated mass ρji it had received from i, j obtains all the mass that i has been trying to send since its last successful communication. Then, j updates its x and y mass by adding the new received masses, and finally, updates its estimate of the average to x/y. To handle delays and asynchronicity, timestamps κi are attached to messages outgoing from i.

The pseudocode for the algorithm may appear complicated at first glance; this is because of the considerable complexity required to deal with directed communications, message losses, delays, and asynchrony.

We next describe the algorithm in more detail. First, in the course of executing the algorithm, every agent i maintains scalar variables xi, yi, zi, ϕix, ϕiy, κi, ρijx, ρijy and κij for (j,i)E. The variables xi and yi have the same evolution, however yi is initialized as 1. Therefore, to save space in describing and analyzing the algorithm, we will use the symbol θ, when a statement holds for both x and y. Similarly, when a statement is the same for both variables x and y, we will remove the superscripts x or y. For example, the initialization ρji(0) = 0 in the beginning of the algorithm means both ρjix(0)=0 and ρjiy(0)=0.

We briefly mention the intuitive meaning of the various variables. The number zi represents node i’s estimate of the initial average. The counter ϕiθ(k) is the total θ-value sent by i to each of its neighbors from time 0 to k − 1. Similarly, ρijθ(k) is the total θ value that i has received from j up to time k − 1. The integer κi is a timestamp that i attaches to its messages, and the number κij tracks the latest timestamp i has received from j.

To obtain an intuition for how the algorithm uses the counters ϕiθ(k) and ρijθ(k), note that, in line 15 of the algorithm, node i effectively figures out the last θ value sent to it by each of its in-neighbors j, by looking at the increment to the ρijθ. This might seem needlessly involved, but, the underlying reason is that this approach introduces robustness to message losses.

Algorithm 1.

Robust Asynchronous Push-Sum (RAPS)

  1: Initialize the algorithm with y(0) = 1, ϕi(0) = 0, ∀i ∈ {1, …, n} and ρij(0) = 0, κij(0) = 0, (j,i)E.
  2: At every iteration k = 0, 1, 2, …, for every node i:
  3: if node i wakes up then
  4: κik;
  5: ϕixϕix+xidi++1, ϕiyϕiy+yidi++1;
  6: xixidi++1, yiyidi++1;
  7:  Node i broadcasts (ϕix,ϕiy,κi) to its out-neighbors in Ni+.
  8: Processing the received messages
  9: for (ϕjx,ϕjy,κj) in the inbox do
10:   if κj>κij then
11:    ρijxϕjx, ρijyϕjy;
12:    κijκj;
13:   end if
14: end for
15: xixi+jNi(ρijxρijx) , yiyi+jNi(ρijyρijy) ;
16: ρijxρijx, ρijyρijy,
17: zixiyi;
18: end if
19: Other variables remain unchanged.

We next describe in words what the pseudocode above does. At every iteration k, if agent i wakes up, it performs the following actions. First, it divides its values xi, yi into di++1 parts and broadcasts these to its out-neighbors; actually, what it broadcasts are the accumulated running sums ϕix and ϕiy. Following Kempe et al. (2003), this is sometimes called the “push step.”

Then, node i moves on to process the messages in its inbox in the following way. If agent i has received a message from node j that is newer than the last one it has received before, it will store that message in ρij and discard the older messages. Next, i updates its x and y variables by adding the difference of ρij with the older value ρij, for all in-neighbors j. As mentioned above, this difference is equal to the new mass received. Next, ρij overwrites ρij in the penultimate step. The last step of the algorithm sets zi to be the rescaled version of xi: zi = xi/yi.

In the remainder of this section, we provide an analysis of the RAPS algorithm, ultimately showing that it converges geometrically to the average in the presence of message losses, asynchronous updates, delays, and directed communication. Our first step is to formulate the RAPS algorithm in terms of a linear update (i.e., a matrix multiplication), which we do in the next subsection.

2.1. Linear Formulation

Next we show that, after introducing some new auxiliary variables, Algorithm 1 can be written in terms of a classical push-sum algorithm (Kempe et al., 2003) on an augmented graph. Since the y-variables have the same evolution as the x-variables, here we only analyze the x-variables.

In our analysis, we will associate with each message an effective delay. If a message is sent at time k1 and is ready to be processed at time k2, then k2k1 ≥ 1 is the effective delay experienced by that message. Those messages that are discarded will not have an effective delay associated with them and are considered as lost.

Next, we will state our assumptions on connectivity, asynchronicity, and message loss.

Assumption 1 Suppose:

  1. Graph G is strongly connected and does not have self-loops.

  2. The delays on each link are bounded above by some Γdel ≥ 1.

  3. Every agent wakes up and performs updates at least once every Γu ≥ 1 iterations.

  4. Each link fails at most Γf ≥ 0 consecutive times.

  5. Messages arrive in the order of time they were sent. In other words, if messages are sent from node i to j at times k1 and k2 with (effective) delays d1 and d2, respectively, and k1 < k2, then we have k1 + d1 < k2 + d2.

One consequence of Assumption 1 is that the effective delays associated with each message that gets through are bounded above by Γd ≔ Γdel + Γu − 1. Another consequence is that for each (i,j)E, j receives a message from i successfully, at least once every Γs iterations where

ΓsΓu(Γf+1)+Γd2. (3)

Part (e) of Assumption 1 can be assumed without loss of generality. Indeed, observe that outdated messages automatically get discarded in Line 10 of our algorithm. For simplicity, it is convenient to think of those messages as lost. Thus, if this assumption fails in practice, the algorithm will perform exactly as if it had actually held in practice due to Line 10. Making this an assumption, rather than a proposition, lets us slightly simplify some of the arguments and avoid some redundancy throughout this paper.

Let us introduce the following indicator variables: τi(k) for i ∈ {1, …, n} which equals to 1 if node i wakes up at time k, and equals 0 otherwise. Similarly, τijl(k), for (i,j)E, 1 ≤ l ≤ Γd which is 1 if τi(k) = 1 and the message sent from node i to j at time k will arrive after experiencing an effective delay of l.2 Note that if node i wakes up at time k but the message it sends to j is lost, then τijl(k) will be zero for all l.

We can rewrite the RAPS algorithm with the help of these indicator variables. Let us adopt the notation that xi(k) refers to xi at the beginning of round k of the algorithm (i.e., before node i has a chance to go through the list of steps outlined in the algorithm box). We will use the same convention with all of the other variables, e.g., yi(k), zi(k), etc. If node i does not wake up at round k, then of course xi(k + 1) = xi(k).

Now observe that we can write

ϕix(k+1)ϕix(k)=τi(k)xi(k)di++1. (4)

Likewise, we have

xi(k+1)=xi(k)(1τi(k)+τi(k)di++1)+jNi(ρijx(k+1)ρijx(k)), (5)

which can be shown by considering each case (τi(k) = 1 or 0); note that we have used the fact that, in the event that node i wakes up at time k, the variable ρijx(k+1) equals the variable ρijx during execution of Line 16 of the algorithm at time k.

Finally, we have that (i,j)E, the flows ρjix are updated as follows:

ρjix(k+1)=ρjix(k)+l=1Γdτijl(kl)(ϕix(k+1l)ρjix(k)), (6)

where we make use of the fact that the sum contains only a single nonzero term, since the messages arrive monotonically. To parse the indices in this equation, note that node i actually broadcasts ϕix(k+1l) in our notation at iteration kl; by our definitions, ϕix(kl) is the value of ϕix at the beginning of that iteration. To simplify these relations, we introduce the auxiliary variables uijx for all (i,j)E, defined through the following recurrence relation:

uijx(k+1)(1l=1Γdτijl(k))(uijx(k)+ϕix(k+1)ϕix(k)), (7)

and initialized as uijx(0)0. Intuitively, the variables uijx represent the “excess mass” of xi that is yet to reach node j. Indeed, this quantity resets to zero whenever a message is sent that arrives at some point in the future, and otherwise is incremented by adding the broadcasted mass that is lost. Note that node i never knows uijx(k), since it has no idea which messages are lost, and which are not; nevertheless, for purposes of analysis, nothing prevents us from considering these variables.

Let us also define the related quantity,

vijx(k)uijx(k)+ϕix(k+1)ϕix(k),fork0,

and vijx(k)0 for k < 0. Intuitively, this quantity may be thought of as a forward-looking estimate of the mass that will arrive at node j, if the message sent from node i at time k gets through; correspondingly, it includes not only the previous unsent mass, but the extra mass that will be added at the current iteration.

The key variables for the analysis of our method are the variables we will denote by xijl(k). Intuitively, every time a message is sent, but gets lost, we imagine that it has instead arrived into a “virtual node” which holds that mass; once the next message gets through, we imagine that the virtual node has forwarded that mass to its intended destination. This idea originates from Hadjicostis et al. (2016). Because of the delays, however, we need to introduce Γd virtual nodes for each such event. If a message is sent from i and arrives at j with effective delay l, we will instead imagine it is received by the virtual node bijl, then sent to bijl1 at the next time step, and so forth until it reaches bij1, and is then forwarded to its destination. These virtual nodes are defined formally later.

Putting that intuition aside, we formally define the variables xijl(k) via the following set of recurrence relations:

xijl(k+1)τijl(k)vijx(k),l=Γd, (8)
xijl(k+1)τijl(k)vijx(k)+xijl+1(k),1l<Γd, (9)

and xijl(k)0 when both k ≤ 0 and l = 1, …, Γd. To parse these equations, imagine what happens when a message is sent from i to j with effective delay of Γd at time k. The content of this message becomes the value of xijΓd according to (8); and, in each subsequent step, influences xijΓd1,xijΓd2, and so forth according to (9). Putting (8) and (9) together, we obtain

xijl(k)=t=1Γdl+1τijt+l1(kt)vijx(kt), (10)

and particularly,

xij1(k)=t=1Γdτijt(kt)vijx(kt). (11)

Note that, as is common in many of the equations we will write, only a single term in the sums can be nonzero (this is not obvious at this point and is a result of Lemma 1).

Before proceeding to the main result of this section, we state the following lemma, whose proof is immediate.

Lemma 1 If τijl(k)=1, the following statements are satisfied:

  1. τijl(k)=0 for l′l.

  2. If l > 0, then τijs(k+t)=0 for t = 1, …, l and s = 0, …, lt.

  3. If l < Γd, then τijs(kt)=0 for t = 1, …, Γdl and s = l + t, …, Γd.

Lemma 2 If τijl(k)=1 then xijl(k)=0 for l′ > l.

Proof By Lemma 1(c), τijt+l1(kt)=0 for t ∈ {1, …, Γdl′ + 1}. Hence, by (10) we have,

xijl(k)=t=1Γdl+1τijt+l1(kt)vijx(kt)=0.

The next lemma is essentially a restatement of the observation that the content of every xijl eventually “passes through” xij1.

Lemma 3 If τijl(kl)=1, l ≥ 1, we have,

l=1lxijl(kl)=t=1lxij1(kt).

Proof We will show xij1(kt)=xijlt+1(kl) for t = 1, …, l. For t = l the equality is trivial. Now suppose t < l. By Lemma 1(a) we have τijlt(kl)=0. Moreover, by part (b) of the same lemma we have, τijs(kl+t)=0 for t′ = lt − 1 and s′ = ltt′. Hence, xijltt+1(kl+t)=xijltt(kl+t+1). Combining these equations for t′ = 0, …, lt−1, we get xij1(kt)=xijlt+1(kl). ■

The following lemma is the key step of a linear formulation of RAPS.

Lemma 4 For k = 0,1,… and (i,j)E we have:

ρjix(k+1)ρjix(k)=xij1(k), (12)
uijx(k+1)+ρjix(k+1)+l=1Γdxijl(k+1)=ϕix(k+1). (13)

Parsing these equations, (12) simply states that the value of xij1(k) can be thought of as impacting ρjix at time k; recall that the content of xij1(k) is a message that was sent from node i to j at time kl with an effective delay of l, for some 1 ≤ l ≤ Γd (cf. Equation 11). On the other hand, (13) may be thought of a “conservation of mass” equation. All the mass that has been sent out by node i has either: (i) been lost (in which case it is in uijx), (ii) affected node j (in which case it is in ρjix), or (iii) is in the process of reaching node j but delayed (in which case it is in some xijl).

Although this lemma is arguably obvious, a formal proof is surprisingly lengthy. For this reason, we relegate it to the Appendix.

We next write down a matrix form of our updates. As a first step, define the (n + m′) × 1 column vector χ(k) ≔ [x(k), x1(k), …, xΓd(k),ux(k)], where m′ ≔ (Γd+1)m, m|E|, x(k) collects all xi(k), xl(k) collects all xijl(k) and, ux(k) collects all uijx(k). Define ψ(k) by collecting y-values similarly.

Now, we have all the tools to show the linear evolution of χ(k). By Equations (4), (5) and (12) we have,

xj(k+1)=xj(k)(1τj(k)+τj(k)dj++1)+iNjxij1(k). (14)

Moreover, by the definitions of xij, vij and (4) it follows,

xijΓd(k+1)=τijΓd(k)[uijx(k)+xi(k)di++1],xijl(k+1)=τijl(k)[uijx(k)+xi(k)di++1]+xijl+1(k). (15)

Finally, by (4) and (7) we obtain,

uijx(k+1)=(1l=1Γdτijl(k))(uijx(k)+τi(k)xi(k)di++1). (16)

Using (14) to (16) we can write the evolution of χ(k) and ψ(k) in the following linear form:

χ(k+1)=M(k)χ(k),ψ(k+1)=M(k)ψ(k), (17)

where M(k)(n+m)×(n+m) is an appropriately defined matrix.

We have thus completed half of our goal: we have shown how to write RAPS as a linear update. Next, we show that the corresponding matrices are column-stochastic.

Lemma 5 M(k) is column stochastic and its positive elements are at least 1/(maxi{di+}+1). Moreover, for i = 1, …,n, Mii(k) are positive.

This lemma can be proved “by inspection.” Indeed, M(k) is column stochastic if and only if, for every χ(k), we have 1Tχ(k + 1) = 1Tχ(k). Thus one just needs to demonstrate that no mass is ever “lost,” i.e., that a decrease/increase in the value of one node is always accompanied by an increase/decrease of the value of another node, which can be done just by inspecting the equations. A formal proof is nonetheless given next.

Proof To show that M(k) is column stochastic, we study how each element of χ(k) influences χ(k + 1).

For i = 1, …, n, the ith column of M(k) represents how xi(k) influences χ(k + 1). We will use (14) to (16) to find these coefficients.

First, xi(k) influences xi(k + 1) with the coefficient 1τi(k)+τi(k)/(di++1)>0. For jNi+,xi(k) influences xijl(k+1) by τijl(k)/(di++1) and uijx(k+1) with coefficient (τi(k)l=1Γdτi(k)τijl(k))/(di++1). Summing these coefficients up results in 1.

For l = 2, …, Γd, (i,j)E,xijl(k) influences xijl1(k+1) with coefficient 1 and xij1(k) influences xj(k+1) with coefficient 1.

Finally, uijx(k) influences xijl(k+1) with coefficient τijl(k) and uijx(k+1) with (1l=1dτijl(k)), which sum up to 1.

Note that all the coefficients above are at least 1/(maxi{di+}+1). ■

An important result of this lemma is the sum preservation property, i.e.,

i=1n+mχi(k)=i=1nxi(0),i=1n+mψi(k)=n. (18)

For further analysis, we augment the graph G to H(k)GM(k)=(VA,EA(k)) by adding the following virtual nodes: bijl for l = 1, …, Γd and (i,j)E, which hold the values xijl and yijl; We also add the nodes cij for (i,j)E which hold the values uijx and uijy.

In H(k), there is a link from bijl to bijl1 for 1 < ld and from bij1 to j as they forward their values to the next node. Moreover, if τijl(k)=1 for some 1 ≤ l ≤ Γd, then there is a link from both cij and i to bijl.

If τijl(k)=0 for 1 ≤ l ≤ Γd then cij has a self loop, and if also τi(k)=1, there’s a link from i to cij. All non-virtual agents iV, have self-loops all the time (see Fig. 2).

Figure 2:

Figure 2:

Augmented graph H(k) for different scenarios.

Recursions (17) and Lemma 5 may thus be interpreted as showing that the RAPS algorithm can be thought of as a push-sum algorithm over the augmented graph sequence {H(k)}, where each agent (virtual and non-virtual) holds an x-value and a y-value which evolve similarly and in parallel.

2.2. Exponential Convergence

The main result of this section is exponential convergence of RAPS to initial average, stated next.

Theorem 6 Suppose Assumption 1 holds. Then RAPS converges exponentially to the initial mean of agent values. i.e.,

|zi(k)1ni=1nxi(0)|δλkx(0)1,

where δ11nα6,λ(1nα6)1/(2nΓs) and α ≔ (1/n)nΓs.

It is worth mentioning that even though 1/(1λ)=O(np(n)) where p(n)=O(n), this is a bound for a worst case scenario and on average, as it can be seen in numerical simulations, RAPS performs better. Moreover, when the graph G satisfies certain properties, such as regularity, and also there is no link delays and failures, we have 1/(1λ)=O(n3) (see Theorem 1 in Nedic and Olshevsky, 2016). More broadly, that paper establishes that 1/(1 − λ) will scale with the mixing rate of the underlying Markov process.

Unfortunately, this theorem does not follow immediately from standard results on exponential convergence of push-sum. The reason is that the connectivity conditions assumed for such theorems are not satisfied here: there will not always be paths leading to virtual nodes from non-virtual nodes. Nevertheless, with some suitable modifications, the existence of paths from virtual nodes to other virtual nodes is sufficient, as we will show next.

Before proving the theorem, we need the following lemmas and definitions. Given a sequence of graphs G0 , G1 , G2 , …, we will say node b is reachable from node a in time period k1 to k2 (k1 < k2), if there exists a sequence of directed edges ek1, ek1 + 1, …, ek2 such that ek is in Gk , the destination of ek is the origin of ek+1 for k1k < k2, and the origin of ek1 is a and the destination of ek2 is b.

Our first lemma provides a standard lower bound on the entries of the column-stochastic matrices from (17).

Lemma 7 Mk+nΓs−1:k has positive first n rows, for any k ≥ 0. The positive elements of this matrix are at least

α=(1/n)nΓs.

Proof By Lemma 5, each node jV has self-loops at every iteration in the augmented graph H . Since G is strongly connected, the set of reachable non-virtual nodes from any node ahVA strictly increases every Γs iterations. Hence, Mk+nΓs−1:k has positive first n rows. Moreover, since all positive elements of M are at least 1/n, the positive elements of Mk+nΓs−1:k are at least (1/n)nΓs. ■

Next, we give a reformulation of the push-sum update that will be key to showing the exponential convergence of the algorithm. The proof is a minor variation of Lemma 4 in Nedic and Olshevsky (2016).

Lemma 8 Consider the vectors u(k)d,v(k)+d, and square matrix A(k)+d×d, for k ≥ 0 such that,

u(k+1)=A(k)u(k),v(k+1)=A(k)v(k). (19)

Also suppose ui(k) = 0 if vi(k) = 0, ∀k,i. Define u(k)d as:

ui(k){1/ui(k),ifui(k)0,0ifui(k)=0.

Define r(k) ≔ u(k) ∘ v(k), wheredenotes the element-wise product of two vectors. Then we have,

r(k+1)=B(k)r(k),

where B(k)+d×d is defined as,

B(k)diag(v(k+1))A(k)diag(v(k)).

Proof Since ui(k) = 0 if vi(k) = 0, ui(k) = ri(k)vi(k) holds for all i, k. Substituting in (19) we obtain,

ri(k+1)vi(k+1)=j=1dAij(k)rj(k)vj(k).

Since, by definition ri(k) = 0 if vi(k) = 0, ∀k,i, we get

ri(k+1)=vi(k+1)j=1dAij(k)rj(k)vj(k).

Therefore,

r(k+1)=diag(v(k+1))A(k)diag(v(k))r(k).

Our next corollary, which follows immediately from the previous lemma, characterizes the dichotomy inherent in push-sum with virtual nodes: every row either adds up to one or zero.

Corollary 9 Consider the matrix B(k) defined in Lemma 8. Let us define the index set Jk ≔ {i|vi(k) ≠ 0}. If iJk, the ith column of B(k) and ith row of B(k − 1) only contain zero entries. Moreover,

B(k)1d=diag(v(k+1))A(k)v(k)=diag(v(k+1))v(k+1)=[1or01or0].

Hence, the ith row of B(k) sums to 1 if and only if vi(k + 1) ≠ 0 or iJk+1.

Our next lemma characterizes the relationship between zero entries in the vectors χ(k) and ψ(k).

Lemma 10 χh(k) = 0 whenever ψh(k) = 0 for h = 1, …, n + m′, k ≥ 0.

Proof First we note that ψ(0)=[1n,0m] and each node iV has a self-loop in graph H(k) for all k ≥ 0; hence, ψh(k) ≥ 0 for all h and particularly, ψi(k) > 0 for i = 1, …, n. Now suppose h > n and corresponds to a virtual agent ahVA. If ψh(k) = 0, it means ah has already sent all its y-value to another node or has not received any y-value yet. In either case, that node also has no remaining x-value as well and χh(k) = 0. ■

Let us define ψ(k)n+m, k ≥ 0 by

ψi(k){1/ψi(k),ifψi(k)0,0,ifψi(k)=0. (20)

Moreover, we define the vector z(k) by setting z(k) ≔ χ(k) ∘ ψ(k). By (17) and Lemma 10, we can use Lemma 8 to obtain,

z(k+1)=P(k)z(k),

where P(k) ≔ diag(ψ(k + 1))M(k)diag(ψ(k)). Let us define

Ik{i|ψi(k)>0}.

Then, by Corollary 9 we have each zi(k + 1), iIk+1, is a convex combination of zj(k)’s for jIk. Therefore,

maxiIk+1zi(k+1)maxiIkzi(k),miniIk+1zi(k+1)miniIkzi(k). (21)

These equations will be key to the analysis of the algorithm. We stress that we have not shown that the quantity mini zi(k) is non-decreasing; rather, we have shown that the related quantity, where the minimum is taken over Ik, the set of nonzero entries of ψ(k), is non-increasing.

Our next lemma provides lower and upper bounds on the entries of the vector ψ(k).

Lemma 11 For k ≥ 0 and 1 ≤ in we have:

nαψi(k)n.

Moreover, for n + 1 ≤ hn + m′ and k ≥ 1 we have either ψh(k) = 0 or,

nα2ψh(k)n.

Proof We have,

ψ(k)=Mk1:0[1n0m],

If k < nΓs, positive entries of Mk−1:0 are at least (1/n)k. Hence, positive entries of ψ(k) are at least,

(1n)k(1n)nΓs1=nα.

Now suppose knΓs. Mk−1:0 is the product of Mk−1:knΓs and another column stochastic matrix. By Lemma 7, Mk−1:knΓs has positive first n rows, and positive entries of at least α. Thus, Mk−1:0 has positive first n rows, and positive entries of at least α as well. We obtain for 1 ≤ in,

ψi(k)nα,fork1.

For n + 1 ≤ hn + m′m, suppose ψh corresponds to a virtual node ah corresponding to some link (i,j)E. If ψh(k) is positive, it is carrying a value sent from i at knΓs or later, which has experienced link failure or delays. This is because each value gets to its destination after at most Γs iterations. Since i has self-loops all the time, ah is reachable from i in period knΓs to k − 1; Hence, Mhik1:knΓsα, and it follows,

ψh(k)αψi(knΓs)nα2.

Also, due to sum preservation property, we have ψh(k) ≤ n, for all h and k ≥ 0. ■

Using Lemma 8 again, it follows,

z(k+nΓs)=P^(k)z(k),

where,

P^(k)diag(ψ(k+nΓs))Mk+nLs1:kdiag(ψ(k)). (22)

Next, we are able to find a lower bound on the positive elements of P^(k). The proof of the following corollary is immediate.

Corollary 12 By (22) and Lemma 11 we have:

  1. P^ij(k)>0 for 1 ≤ i,jn.

  2. Positive entries of first n columns of P^(k) are at least (1/n)α() = α2. Similarly, the last m′ columns have positive entries of at least α3.

  3. For h > n, if hIk+nΓs then P^hi(k)>0 for some 1 ≤ in.

Our next lemma, which is the final result we need before proving the exponential convergence rate of RAPS, provides a quantitative bound for how multiplication by the matrix P shrinks the range of a vector.

Lemma 13 Let t ≥ 0 and {u(k)}k0n+m be a sequence of vectors such that,

u(k+1)=P^(knΓs+t)u(k).

Define

st(k)maxiIknΓs+tui(k)miniIknΓs+tui(k).

Then,

st(k+2)(1nα6)st(k).

Proof Let us define

rt(k)max1inui(k)min1inui(k).

By Corollary 12 for jI(k+1)nΓs+t the jth row of P^(knΓs+t) has at least one positive entry in the first n columns. Thus, because uj(k + 1) is maximized/minimized when all of the weight is put on the largest/smallest possible entry of uj(k), we have:

uj(k+1)α3max1inui(k)+(1α3)maxiIknΓs+tui(k),uj(k+1)α3min1inui(k)+(1α3)miniIknΓs+tui(k),

Therefore,

st(k+1)α3rt(k)+(1α3)st(k). (23)

Moreover, by a similar argument for jn,

uj(k+1)α3i=1nui(k)+(1nα3)maxiIknΓs+tui(k),uj(k+1)α3i=1nui(k)+(1nα3)miniIknΓs+tui(k).

Thus,

rt(k+1)(1nα3)st(k).

Combining with (23) and noting that rt(k) ≤ st(k) and st(k + 1) ≤ st(k) we obtain,

st(k+2)α3(1nα3)st(k)+(1α3)st(k+1)α3(1nα3)st(k)+(1α3)st(k)=(1nα6)st(k).

Proof of Theorem 6 Using Lemma 13 with t = 0 and u(k) = z(knΓs) we get s0(k) ≤ (1 − 6)k/2⌋s0(0) and limk→∞ s0(k) = 0. Moreover by (21), zmax(k) is a non-increasing sequence and by zmin(k) is non-decreasing. Thus,

limk,hIkzh(k)=L. (24)

We have:

L=Llimki=1n+mψi(k)i=1n+mψi(k)=limk(i=1n+mzi(k)ψi(k)n+i=1n+m(Lzi(k))ψi(k)n)=limk(i=1n+mχi(k)n+i=1n+m(Lzi(k))ψi(k)n)=i=1nxi(0)n.

In the above, we used (18) and (24), the boundedness of ψi(k), and the fact that ψi(k) = 0 for iIk.

Finally, to show the exponential convergence rate, we go back to s0(k). We have for k ≥ 1,

s0(k)(1nα6)k/2s0(0)(1nα6)(k1)/2s0(0),s0(0)i=1n+m|zi(0)|=i=1n|xi(0)|=x(0)1,

where the first equality holds because I0 = {1, …, n} and yi(0) = 1. Therefore, we have for iIk,

zi(k)1x(0)nzmax(k)zmin(k)s0(knΓs)(1nα6)(knΓs1)2x(0)1(1nα6)(knΓs11)2x(0)1=11nα6((1nα6)1(2nΓs))kx(0)1=δλkx(0)1.

where δ=11nα6 and λ = (1 − 6)1/(2nΓs). Note that {1, …, n} ⊆ Ik, ∀k. ■

Remark: Observe that our proof did not really use the initialization ψ(0) = 1, except to observe that the elements ψ(0) are positive, add up to n, and the implication that ψ(k) satisfies the bounds of Lemma 11. In particular, the same result would hold if we viewed time 1 as the initial point of the algorithm (so that ψ(1) is the initialization), or similarly any time k. We will use this observation in the next subsection.

2.3. Perturbed Push-Sum

In this subsection, we begin by introducing the Perturbed Robust Asynchronous Push-Sum algorithm, obtained by adding a perturbation to the x-values of (non-virtual) agents at the beginning of every iteration they wake up.

We show that, if the perturbations are bounded, the resulting z(k) nevertheless tracks the average of χ(k) pretty well. Such a result is a key step towards analyzing distributed optimization protocols. In this general approach to the analyses of distributed optimization methods, we follow Ram et al. (2010) where it was first adopted; see also Nedic and Olshevsky (2016) and Nedic and Olshevsky (2015) where it was used.

Adopting the notations introduced earlier and by the linear formulation (17) we have,

χ(k+1)=M(k)(χ(k)+Δ(k)),fork0,

Algorithm 2.

Perturbed Robust Asynchronous Push-Sum

1: Initialize the algorithm with y(0) = 1 ϕi(0) = 0, ∀ii ∈ {1, …, n} and ρij(0) = 0, κij(0) = 0, (j,i)E and Δ(0) = 0.
2: At every iteration k = 0, 1, 2, …, for every node i:
3: if node i wakes up then
4: xixi + Δi(k)
5:  Lines 4 to 17 of Algorithm 1
6: end if
7: Other variables remain unchanged.

where Δ(k)n+m collects all perturbations Δi(k) in a column vector with Δh(k) ≔ 0 for n < hn + m’. We may write this in a convenient form as follows.

χ(k+1)=M(k)(χ(k)+Δ(k))=t=1kMk:tΔ(k)+Mk:0χ(0).

Define for k ≥ 1,

χt(k)Mk1:tΔ(t),1tk,χ0(k)Mk1:0χ(0),t=0. (25)

We obtain

χ(k)=t=0k1χt(k),k1. (26)

Define zt(k) ≔ χt(k) ∘ ψ(k) for 0 ≤ tk (cf. Equation 20). We have

z(k)=t=0k1zt(k). (27)

We may view each zt(k) as the outcome of a push-sum algorithm, initialized at time t, and apply Theorem 6. This immediately yields the following result, with part (b) an immediate consequence of part (a).

Theorem 14 Suppose Assumption 1 holds. Consider the sequence {zi(k)}, 1 ≤ in, generated by Algorithm 2. Then,

  1. For k = 1,2,….
    |zi(k)1χ(k)n|δλkx(0)1+t=1k1δλktΔ(t)1.
  2. If limt→∞Δ(t)‖1 = 0 then,
    limk|zi(k)1χ(k)n|=0.

3. Robust Asynchronous Stochastic Gradient-Push (RASGP)

In this section we present the main contribution of this paper, a distributed stochastic gradient method with asymptotically network-independent and optimal performance over directed graphs which is robust to asynchrony, delays, and link failures.

Recall that we are considering a network G of n agents whose goal is to cooperatively solve the following minimization problem

minimizeF(z)i=1nfi(z),overzd,

where each fi:d is a strongly convex function only known to agent i. We assume agent i has the ability to obtain noisy gradients of the function fi.

The RASGP algorithm is given as Algorithm 3. Note that we use the notation g^i(k) for a noisy gradient of the function fi(z) at zi(k) i.e.,

g^i(k)=gi(k)+εi,

where gi(k) ≔ ∇fi(zi(k)) and εi is a random vector.

The RASGP is based on a standard idea of mixing consensus and gradient steps, first analyzed in Nedic and Ozdaglar (2009). The push-sum scheme of Section 2, inspired by Hadjicostis et al. (2016), is used instead of the consensus scheme, which allows us to handle delays, asynchronicity, and message losses; this is similar to the approach taken in Nedic and Olshevsky (2015). We note that a new step-size strategy is used to handle asynchronicity: when a node wakes up, it takes steps with a step-size proportional to the sum of all the step-sizes during the period it slept. As far as we are aware, this idea is new.

We will be making the following assumption on the noise vectors.

Assumption 2 εi is an independent random vector with bounded support, i.e.,εi‖ ≤ bi, i = 1, …, n. Moreover, E[εi]=0 and E[εi2]σi2.

Next, we state and prove the main result of this paper, which states the linear convergence rate of Algorithm 3.

Theorem 15 Suppose that:

  1. Assumptions 1 and 2 hold.

  2. Each objective function fi(z) is μi-strongly convex over d.

  3. The gradients of each fi(z) are Li-Lipschitz continuous, i.e., for all z1,z2d,
    gi(z1)gi(z2)Liz1z2.

Then, the RASGP algorithm with the step-size α(k) = n/(μk) for k ≥ 1 and α(0) = 0, will converge to the unique optimum z* with the following asymptotic rate: for all i = 1, …, n, we have

E[zi(k)z*2]Γuσ2kμ2+Ok(1k1.5),

where σ2iσi2, μ = ∑i μi.

Algorithm 3.

Robust Asynchronous Stochastic Gradient-Push (RASGP)

  1: Initialize the algorithm with y(0) = 1, ϕix(0)=0 , ϕiy(0)=0 , κi(0) = −1, ∀i ∈ {1, …, n} and ρijx(0)=0 , ρijy(0)=0 , κij(0) = −1, (j,i)E .
  2: At every iteration k = 0, 1, 2, …, for every node i:
  3: if node i wakes up then
  4: βi(k)=t=κi+1kα(t);
  5: xixiβi(k)g^i(k);
  6: κik;
  7: ϕixϕix+xidi++1, ϕiyϕiy+yidi++1;
  8: xixidi++1, yiyidi++1;
  9:  Node i broadcasts ( ϕix , ϕiy , κi) to its out-neighbors: Ni+
10: Processing the received messages
11: for (ϕjx,ϕjy,κj) in the inbox do
12:   if κj>κij then
13:    ρijxϕjx, ρijyϕjy;
14:    κijκ
15:   end if
16: end for
17: xixi+jNi(ρijxρijx) , yiyi+jNi(ρijyρijy) ;
18: ρijxρijx, ρijyρijy;
19: zixiyi;
20: end if
21: Other variables remain unchanged.

Remark 16 We note that each agent stores variables xi, yi, κi, zi, ϕix , ϕiy and ρijx , ρijy , κij for all in-neighbors jNi . Hence, the memory requirement of the RASGP algorithm for each agent is O(di) for each agent i.

We next turn to the proof of Theorem 15. First, we observe that Algorithm 3 is a specific case of multi-dimensional Perturbed Robust Asynchronous Push-Sum. In other words, each coordinate of vectors xi, zi, ϕix and ρijx will experience an instance of Algorithm 2. Hence, there exists an augmented graph sequence {H(k)} where the Algorithm 3 is equivalent to perturbed push-sum consensus on H(k) where each agent ahVA holds vectors xh and yh. In other words, we will be able to apply Theorem 14 to analyze Algorithm 3.

Our first step is to show how to decouple the action of Algorithm 3 coordinate by coordinate. For each coordinate 1 ≤ d, let χln+m stack up the th entries of x-values of all agents (virtual and non-virtual) in VA . Additionally, define Δ(k)n+m to be the vector stacking up the th entries of perturbations. i.e.,

[Δ(k)]i{βi(k)[g^i(k)],ifiV,τi(k)=1,0,otherwise.

Then, by the definition of the algorithm, we have for all = 1, …, d,

χ(k+1)=M(k)(χ(k)+Δ(k)),ψ(k+1)=M(k)ψ(k). (28)

These equations write out the action of Algorithm 3 on a coordinate-by-coordinate basis.

In order to prove Theorem 15, we need a few tools and lemmas. As already mentioned, our first step will be to argue that Algorithm 3 converges by application of Theorem 14. This requires showing the boundedness of the perturbations Δ(k), which, as we will show, reduces to showing the vectors zi(k) are bounded. The following lemma will be useful to establish this boundedness.

Lemma 17 (Nedic and Olshevsky, 2016, Lemma 3) Let q:d be a ν-strongly convex function with ν > 0 which has Lipschitz gradients with constant L. let vd and ud defined by,

u=vα(q(v)+p(v)),

where α ∈ (0,ν/8L2] and p:dd is a mapping such that,

p(v)c,for allvRd.

Then, there exists a compact set Sd and a scalar R such that,

u{v,for allvS,R,for allvS,

where,

S{z|q(z)q(0)+2ν8L2(q(0)2+c2)}B(0,4cν),RmaxzS{z+ν8L2q(z)}+νc8L2.

We now argue that the iterates generated by Algorithm 3 are bounded.

Lemma 18 The iterates zi(k) generated by Algorithm 3 will remain bounded.

Proof Let us adopt the notation ψ from previous sections and define zl(k)χl(k)ψ(k)n+m. Moreover, adopt the notation zh for virtual agent ah, h = n + 1, …, n+m′, as zh(k) ≔ xh(k)/ψh(k). Also define uln+m by

u(k)χ(k)+Δ(k).

Since the perturbations are only added to the non-virtual agents, which have strictly positive y-values, we conclude [u(k)]h = 0 if ψh(k) = 0. Hence, the assumptions of Lemma 8 and Corollary 9 are satisfied. Adopting the definition of Ik and P(k) from previous sections, we get for iIk+1.

[z(k+1)]i=jIkPij(k)[u(k)]jψj(k).

Combining the equation above for = 1,…d we obtain:

zi(k+1)=jIkPij(k)uj(k)ψj(k), (29)

where uj(k)d is created by collecting the jth entries of all u(k), i.e.,

ui(k)={xi(k)βi(k)g^i(k),ifiVandτi(k)=1,xi(k),otherwise.

Now consider each term on the right hand side of (29) for jIk. Suppose jn and τj(k) = 1, then we have:

uj(k)yj(k)=zj(k)βj(k)yj(k)(fj(zj(k))+εj(k)).

Since limk→∞ α(k) = 0 and kκi(k) = 0 and kκi(k) ≤ Γu, limk→∞ βj(k) = 0. Moreover, by Lemma 11, yj(k) is bounded below; thus, limk→∞ βj(k)/yj(k) = 0 and there exists kj such that for kkj, ,βj(k)/yj(k)(0,μj/8Lj2]. Applying Lemma 17, it follows that for each j there exists a compact set Sj and a scalar Rj such that for kkj, if τj(k) = 1,

uj(k)yj(k){zj(k)Rj,ifzj(k)Sj,ifzj(k)Sj. (30)

Moreover, if τj(k) = 0 or j > n we have,

uj(k)yj(k)=zj(k). (31)

Let kz ≔ maxi ki. Using mathematical induction, we will show that for all kkz:

maxiIkzi(k)R¯, (32)

where R¯max{maxiRi,maxjIkzzj(kz)}. Equation (32) holds for k = kz. Suppose it is true for some kkz. Then by (30) and (31) we have,

ui(k)yi(k)max{Ri,zi(k)}R¯. (33)

Also by (29), for iIk+1, zi(k + 1) is a convex combination of uj(k)/yj(k)’s, where jIk. Hence,

zi(k+1)jIkPijuj(k)ψj(k)R¯.

Define Bzmax{R¯,maxiIk,k<kzzi(k)} and we have ‖zi(k)‖ ≤ Bz, ∀k ≥ 0. ■

We next explore a convenient way to rewrite Algorithm 3. Let us introduce the quantity wi(k), which can be interpreted as the x-value of agent i, if it performed a gradient step at every iteration, even when asleep:

wi(k)={xi(k)(t=κi(k)+1k1α(t))gi(k),ifiV,xi(k),otherwise. (34)

Also, define wn+m by collecting the th dimension of all wi’s and w¯(k)(i=1n+mwi(k))/n. Moreover, define gn+m by collecting the th value of gradients of all agents (0 for virtual agents), i.e.,

[g(k)]i={[gi(k)]ifiV,0,otherwise.

Additionally, define ε^i(k)d as the noise injected to the system at time k by agent i, i.e.,

ε^i(k)={βi(k)εi(k),ifiVandτi(k)=1,0,otherwise,

and ε^l(k)n+m as the vector collecting the th values of all ε^i(k)’s.

We then have the following lemma.

Lemma 19

w(k+1)=M(k)(w(k)α(k)g(k)ε^). (35)

Proof We consider two cases:

  • If τi(k) = 0, then (35) reduces to wi(k + 1) = wi(k) − α(k)gi(k); noting that, because node i did not update at time k we have that gi(k) = gi(k + 1) and this is the correct update.

  • For all other nodes (i.e., for both virtual nodes and nodes with τi(k) = 1, we have [w(k)α(k)g^(k)ε^(k)]i=[χ(k)+Δ(k)]i in (28). Since χ(k+1) = M(k)(χ(k) + Δ(k)) and, using the definition of wi(k), we have that for these nodes,
    wi(k+1)=xi(k+1);

(28) implies the conclusion. ■

This lemma allows us to straightforwardly analyze how the average of w(k) evolves. Indeed, summing all the elements of (35) and dividing by n for each = 1, …, d we obtain,

w¯(k+1)=w¯(k)α(k)ni=1ngi(k)1ni=1nε^i(k)=w¯(k)α(k)ni=1nfi(w¯(k))i1ni=1nε^i(k)α(k)ni=1n(gi(k)f(w¯(k))). (36)

We next give a sequence of lemmas to the effect that all the quantities generated by the algorithm are close to each other over time. Define,

x¯(k)=1nahVAxh(k).

where, recall, VA is our notation for all the nodes in the augmented graph (i.e., including virtual nodes). Moreover, we will extend the definition of from Line 4 of Algorithm 3 to all k via the same formula βi(k)t=κi(k)+1kα(t). Our first lemma will show that each zi(k) closely tracks x¯(k).

Lemma 20 Using Algorithm 3 with α(k) = n/(), under the assumptions of Theorem 15, we have for each i, zi(k+1)x¯(k+1)=Ok(1/k).

Proof By Theorem 14(a) we have for each ,

|[z(k+1)]i1χ(k+1)n|δλkχ(0)1+t=1kδλktΔ(t)1.

Summing the above inequality for = 1, …, d we obtain,

zi(k+1)x¯(k+1)1j=1n(δλkxj(0)1+t=1kδλktβi(t)τi(t)g^j(t)1).

Moreover,

βi(k)=t=κi(k)+1knμtnμ(kκi(k)κi(k)+1). (37)

But,

κi(k)<kκi(k)+Γu.

Since Γu ≥ 1, we obtain

k(κi(k)+1)Γu,

or,

1κi(k)+1Γuk.

Thus, from (37) we have,

βi(k)nΓu2μk. (38)

Define,

MjmaxzBzgj(z)1, (39)

and observe that Mj is finite by Lemma 18. Also τj(k) ≤ 1. We obtain,

zi(k+1)x¯(k+1)1j=1n(δλkxj(0)1+t=1kδλktnΓu2μt(Mj+bj)).

Let RHS denote the right hand side of the relation above. We have,

RHS=j=1n(δλkxj(0)1+δnΓu2μ(Mj+bj)(t=1k2λktt+t=k2+1kλktt))j=1n(δλkxj(0)1+δnΓu2μ(Mj+bj)(k2λk2+2(1λ)k))=Ok(1k),

where we used the following relations,

t=1k2λkttk2λkk2k2λk2,t=k2+1kλkttt=0k21λtk2+12(1λ)k.

Finally, ‖v2 ≤ ‖v1 for all vectors v, completes the proof. ■

An immediate consequence of this lemma is that the quantities x¯(k) and w¯(k) are close to each other.

Lemma 21 Using Algorithm 3 with α(k) = n/(), under the assumptions of Theorem 15, we have, x¯(k)w¯(k)=Ok(1/k).

Proof By definition of w we have,

x¯(k)w¯(k)=1ni=1n(t=κi(k)+1k1α(t))gi(k).

Using (38) we have,

x¯(k)w¯(k)1ni=1nβi(k)Mii=1nΓu2Minμk=Ok(1k),

where Mi was defined through (39). ■

We next remark on a couple of implications of the past series of lemmas.

Corollary 22 We have zi(k)w¯(k)=Ok(1k).

Lemma 23 gi(k)fi(w¯(k))=Ok(1k).

Proof Since ∇fi is Li-Lipschitz, we have,

gi(k)fi(w¯(k))Lizi(k)w¯(k).

Using Corollary 22, the lemma is proved. ■

We are now in a position to rewrite Algorithm 3 as a sort of perturbed gradient descent. Let us define,

η(k)1μki=1n(gi(k)fi(w¯(k))).

By Lemma 23, η(k)=Ok(1/k2). Therefore, there exists Bη such that η(k) ≤ Bη/k2 for all k ≥ 1.

By (36) we have,

w¯(k+1)=w¯(k)1μkF(w¯(k))ε¯(k)η¯(k), (40)

where

  • The function Fi=1nfid is μ-strongly-convex with L-Lipschitz gradient, where Li=1nLi.

  • The noise ε¯(k)(i=1nε^i(k))/n is bounded (i.e., ε¯(k)B(0,re)), with probability one, where re ≔ (Γu/μ)∑j bj), and E[ε¯(k)]=0.

In other words, with the exception of the η(k) term, what we have is exactly a stochastic gradient descent method on the function F(⋅).

The following lemmas bound ε¯(k). Let us define νi(k) = kκi(k) as the number of iterations agent i has skipped since it’s last update. By Assumption 1, νi(k) ≤ Γu.

Lemma 24 We have βi(k)=Ok(1/k), ∀i. Moreover,

βi(k)nνi(k)μk+Ok(k2).

Proof Since νi(k) ≤ Γu, ∀i, we have for κi(k) ≥ 1,

βi(k)=t=κi(k)+1knμtnμln(kκi(k))nμln(kkνi(k))=nμln(νi(k)kνi(k))nνi(k)μ(kνi(k))=nνi(k)μk+Ok(k2).

Corrllary 25 μkε¯(k) is bounded.

Lemma 26 There exists Bϵ > 0 such that We have,

E[ε¯(k)2]Γu2μ2k2σ2+Bϵk4.

Proof Using Lemma 24, we have for k > Γu,

E[ε¯(k)2]=E[1ni=1nβi(k)εi(k)τi(k)2]=1n2i=1nβi2(k)E[εi(k)2]1n2i=1nβi2(k)σi2Γu2μ2k2σ2+Ok(k4),

where the second equality is the result of the noise terms being independent and zero-mean. ■

Our next observation is a technical lemma which is essentially a rephrasing of Lemma 17 above.

Lemma 27 There exists a constant Bw and time kw such that w¯(k)Bw with probability one, for kkw.

Proof We have

w¯(k+1)=w¯(k)1μk[F(w¯(k))+μk(ε¯(k)+η(k))],

where μkε¯(k)+η(k) is bounded. Moreover, there exists kw such that for kkw, 1μk(0,μ/8L2]. Therefore, by Lemma 17 there exists a compact set Sw and a scalar Rw > 0 such that for kkw,

w¯(k+1){w¯(k),forw¯Sw,Rw,forw¯Sw.

Therefore, setting Bwmax{Rw,w¯(kw)} will complete the proof. ■

As a consequence of this lemma, because η(k)2Bη, this lemma implies there is a constant B1 such that for kkw,

w¯(k)z1μkF(w¯(k))ε¯(k)B1, (41)

with probability one. This now puts us in a position to show that w¯(k) converges in mean square to the optimal solution.

Lemma 28 E[w¯(k)z2]0.

Proof Using the definition of kw from Lemma 27, we have that for kkw,

E[w¯(k+1)z2]E[w¯(k)z1μkF(w¯(k))ε¯(k)2+2η(k)w¯(k)z1μkF(w¯(k))ε¯(k)+η(k)2].

We will bound each of the terms on the right. We begin with the easiest one, which is the last one:

η(k)2Bη2k4. (42)

The middle term is bounded as

2η(k)w¯(k)z1μkF(w¯(k))ε¯(k)2BηB1k2, (43)

where we used (41).

Finally, we turn to the first term which we denote by T1:

T1Ew¯(k)z22μkE[F(w¯(k))(w¯(k)z)]+L2μ2k2E[w¯(k)z2]+E[ε¯(k)2],

where we used the usual inequality F(w¯(k))2L2w¯(k)z2 which follows from ∇F(⋅) being L-Lipschitz. Now, using the standard inequality

F(w¯(k))T(w¯(k)z)F(w¯(k))F(z)+μ2w¯(k)z2μw¯(k)z2,

and Lemma 26 we obtain,

T1(12k+L2μ2k2)E[w¯(k)z2]+Γu2μ2k2σ2+Bϵk4. (44)

Now putting together (42), (43), and (44), we get,

E[w¯(k+1)z2](12k+L2μ2k2)E[w¯(k)x2]+Γu2σ2μ2k2+2BηB1k2+Bη2+Bϵk4.

For large enough k, we can bound the inequality above as,

E[w¯(k+1)z2](11.5k)E[w¯(k)z2]+B2k2, (45)

where B2=Γu2σ2/μ2+2BηB1+Bη2+Bϵ. Using Lemma 29, stated next, we conclude E[w¯(k+1)z2]0. ■

Lemma 29 Let a > 1, b ≥ 0 and {xt} be a non-negative sequence which satisfies,

xt+1(1at)xt+bt2,fortt>0.

Then for all tt’ we have,

xtmt,

where m ≔ max{t′xt′,b/(a − 1)}.

This lemma is stated and proved for t′ = 1 in (Rakhlin et al., 2012, Lemma 3), and the case of general t′ follows immediately.

We are almost ready to complete the proof of Theorem 15; all that is needed is to refine the convergence rate of w¯(k) to x*. Now as a consequence of (45) and Lemma 29, we may use the inequality E[|X|]E[X2] to obtain that

E[w¯(k)z]=Ok(1k). (46)

Furthermore, by the finite support of μkε¯(k), by Corollary 25, we also have that

E[w¯(k)z1μkF(w¯(k))ε¯(k)]=Ok(1k). (47)

We now use these observations to provide a proof of our main result.

Proof of Theorem 15 Essentially, we rewrite the proof of Lemma 28, but now using the fact that E[w¯(k)z]=Ok(1/k) from (46). This allows us to make two modification to the arguments of that lemma. First, we can now replace (43) by

E[2η(k)w¯(k)z1μkF(w¯(k))ε¯(k)]2Bηk2Ok(1k), (48)

where we used (47). Second, putting together (42), (48), and (44), we obtain:

E[w¯(k+1)z2](12k+L2μ2k2)E[w¯(k)z2]+E[ε¯(k)2]+Bη2k4+2Bηk2Ok(1k).

which, again using the fact that E[w¯(k)z2]=Ok(1/k), we simply rewrite as,

E[w¯(k+1)z2](12k)E[w¯(k)z2]+E[ε¯(k)2]+Ok(1k2.5).

To save space, let us define akE[w¯(k)z2]. Multiplying both sides of relation above by k2 we obtain,

ak+1k2ak(12k)k2+E[ε¯(k)2]k2+Ok(k0.5).

Note that,

(12k)k2=k22k<(k1)2.

Thus,

ak+1k2ak(k1)2+E[ε¯(k)2]k2+Ok(k0.5).

Summing the relation above for k = 0, …, T implies,

aT+1T2k=0TE[ε¯(k)2]k2+OT(T0.5).

Now, let us estimate the first term on the right hand side of relation above,

k=0TE[ε¯(k)2]k2k=0Ti=1nβi2(k)n2σi2τi(k)k2=i=1nσi2μ2k=0Tνi(k)2τi(k)+OT(T1),

where we used Lemma 24 in the last equality. Define ti(j) as the j’th time agent i has woken up, and set ti(0) = −1. Then we can rewrite the relation above as,

k=0Tνi(k)2τi(k)=j=1ti(j)T(ti(j)ti(j1))2j=1ti(j)TΓu(ti(j)ti(j1))Γu(T+1).

Combining relations above and then dividing both sides by T2 we obtain,

aT+1Γuσ2μ2T+OT(T1.5). (49)

We next argue that the same guarantee holds for every zi(k). Indeed, for each i = 1, …, m,

zi(k)z2=zi(k)w¯(k)+w¯(k)z2=zi(k)w¯(k)2+2zi(k)w¯(k)w¯(k)z+w¯(k)z2.

Now from Corollary 22, we know that with probability one, zi(k)w(k)2=Ok(1/k). Taking expectation of both sides and using (49) along with the usual bound E[|X|]E[X2], we have

E[zi(k)z2]=Ok(1k2)+Ok(1k1.5)+E[w¯(k)z2].

Putting this together with (49) completes the proof. ■

3.1. Time-Varying Graphs

We remark that Theorems 6, 14 and 15 all extend verbatim to the case of time-varying graphs with no message losses. Indeed, only one problem appears in extending the proofs in this paper to time-varying graphs: a node i may send a message to node j; that message will be lost; and afterwards node i never sends anything to node j again. In this case, Lemmas 7 and 11 do not hold. Indeed, examining Lemma 11, we observe what can very well happen is that all of χi(k) and ψi(k) are “lost” over time into messages that never arrive. However, as long as no messages are lost, the proofs in this paper extend to the time-varying case verbatim. On a technical level, the results still hold if uijx(k)=0 , uijy(k)=0 (virtual node cijVA holds no lost message), when link (i,j) is removed from the network at time k, and the graph G stays strongly connected (or B-connected, i.e., there exists a positive integer B such that the union of every B consecutive graphs is strongly connected).

3.2. On the Bounds for Delays, Asynchrony, and Message Losses

It is natural to what extent the assumption of finite upper bounds on delays, asynchrony, and message losses are really necessary. A natural example which falls outside our framework is a fixed graph G, where, at each time step, every link in G appears with probability 1/2. A more general model might involve a different probability pe of failure for each edge e.

We observe that our result can already handle this case in the following manner. For simplicity, let us stick with the scenario where every link appears with probability 1/2. Then the probability that, after time t, some link has not appeared is at most m(1/2)t, where m is the number of edges in G. This implies that if we choose B = O(log(mnT)), then with hight probability, the sequence of graphs G1,…GT is B-connected.

Thus our theorem applies to this case, albeit at the expense of some logarithmic factors due to the choice of B. We remark that it is possible to get rid of these factors by directly analyzing the decrease in E[z(t)z22] coming from the random choice of graph G. Since our arguments are already quite lengthy, we do not pursue this generalization here, and refer the reader to Lobel and Ozdaglar (2010); Srivastava and Nedic (2011) where similar arguments have been made.

4. Numerical Simulations

4.1. Setup

In this section, we simulate the RASGP algorithm on two classes of graphs, namely, random directed graphs and bidirectional cycle graphs. The main objective function is chosen to be a strongly convex and smooth Support Vector Machine (SVM), i.e. F(ω,γ)=12(ω2+γ2)+CNj=1Nh(bj(Ajω+γ)) where ωd1 and γ are the optimization variables, and Ajd1,bj{1,+1}, j = 1, …, N are the data points and their labels, respectively. The coefficient CN penalizes the points outside of the soft margin. We set CN = c/N, c = 500 in our simulations, which depends on the total number of data points. Here, h: is the smoothed hinge loss, initially introduced in Rennie and Srebro (2005), defined as follows:

h(ξ)={0.5ξ,ifξ<0,0.5(1ξ)2,if0ξ<1,0,if1ξ.

To solve this problem in a distributed way, we suppose all data points are spread among agents. Hence, the local objective functions are fi(ωi,γi)=12n(ω2+γ2)+CNjDih(bj(Ajω+γ)), where Di ⊂ {1,2, …, N} is an index set for data points of agent i and N is the total number of data points. We choose the size of the data set for each local function to be a constant (|Di| = 50), thus N = 50n. It is easy to check that each fi has Lipschitz gradients and is strongly convex with μi = 1/n.

We will compare our results with a centralized gradient descent algorithm, which updates every Γu iterations using the step-size sequence αc(k) = Γu/(μk), in the direction of the sum of the gradients of all agents.

To make gradient estimates stochastic, we add a uniformly distributed noise εiU[b/2,b/2]d to the gradient estimates of each agent and εcU[nb/2,nb/2]d to the gradient of the centralized gradient descent, where U[b1,b2]d denotes the uniform distribution of size d over the interval [b1,b2), b1 < b2. Note that εi and εc are bounded and have zero mean and E[εi2]=db2/12 and E[εc2]=db2/12. We set b = 4 for all simulations.

Agents wake up with probability Pw and links fail with probability Pf, unless they reach their maximum allowed value where the algorithm forces the agent to wake up or the link to work successfully. The link delays are chosen uniformly between 1 to Γdel.

Each data set Di is synthetically generated by picking 25 data points around each of the centers (1,1) and (3,3) with multivariate normal distributions, labeled −1 and +1, respectively. In generating strongly connected random graphs, we pick each edge with a probability of 0.5 and then check if the resulting graph is strongly connected; if it isn’t, we repeat the process. Since the initial step-sizes for the distributed algorithm can be very large (e.g., α(1) = 50 for n = 50), to stabilize the algorithms, both algorithms are started with k0 = 100. This wouldn’t affect the asymptotic convergence performance. Moreover, the initial point of the centralized algorithm and all agents in RASGP are chosen as 1d.

Let us denote by z(k)(1/n)i=1nz=i(k) the average of z-values of non-virtual agents. Then, we define optimization errors Edistz(k)z2 and Ec(k)xc(k)z2 for RASGP and Centralized stochastic gradient descent, respectively.

Since our performance guarantees are for the expectation of (squared) errors, for each network setting, we perform up to 1000 Monte-Carlo simulations and use their corresponding performance to estimate the average behavior of the algorithms. Since accurately estimating the true expected value requires an extremely large number of simulations, in order to alleviate the effect of spikes and high variance, we take the following steps. First a batch of simulations are performed and their average is calculated. Next, to obtain a smoother plot, an average over every 100 iterations is taken. And finally, the median of these outputs over all the batches is our estimate of the expected value.

We report two figures for each setting: one including the errors Edist and Ec, and another one including k × Edist and k × Ec to demonstrate the convergence rates.

Finally, to study the non-asymptotic behavior of RASGP and its dependence on network size n, we have compared the performance of the centralized stochastic gradient descent and RASGP over a bidirectional cycle graph, with error variances of n2σ^2 and σi2=σ^2, respectively. Then, we plot the ratio Ec(k)/Edist(k) over n, for different iterations k.

4.2. Results

Our simulation results are consistent with our theoretical claims (due to the performance of centralized and decentralized methods growing closer over time) and show the achievement of an asymptotic network-independent convergence rate.

Fig. 3 shows that when there is no link failure or delay and all agents wake up at every iteration (Γs = 2), RASGP and centralized gradient descent have very similar performance. When we allow links to have delays and failures (see Fig. 4), as well as asynchronous updates (see Fig. 5), it takes longer for RASGP to reach its asymptotic convergence rate.

Figure 3:

Figure 3:

Results on a directed cycle graph of size n = 50, synchronous with no delays and link failures (Pw = 1, Pf = 0, Γdel = Γf = 0, Γu = 1, Γs = 2).

Figure 4:

Figure 4:

Results on a directed cycle graph of size n = 50, synchronous with delays and link failures (Pw = 1, Pf = 0.3, Γdel = Γf = 3, Γu = 1, Γs = 7).

Figure 5:

Figure 5:

Results on a directed cycle graph of size n = 50, asynchronous with delays and link failures (Pw = 0.5, Pf = 0.3, Γdel = Γf = 3, Γu = 3, Γs = 17).

We observe that, with all the other parameters fixed, the RASGP performs better on a random graph than on a cycle graph (see Figs. 5 and 6). A possible reason is that the cycle graph has a higher diameter or mixing time compared to the random graph, resulting in a slower decay of the consensus error.

Figure 6:

Figure 6:

Results on a directed random graph of size n = 50, asynchronous with delays and link failures (Pw = 0.5, Pf = 0.3, Γdel = Γf = 3, Γu = 3, Γs = 17).

We notice that by fixing the network size, increasing the number of iterations brings us closer to linear speed-up (see Fig. 7). On the other hand, when fixing the number of iterations, increasing the number of nodes, after a certain point, does not help speeding up the optimization. Moreover, by allowing link delays and failures (see Fig. 7b) we require more iterations to achieve network independence.

Figure 7:

Figure 7:

Error ratio over network size. Shaded areas correspond to 1-standard-deviation of the performance.

5. Conclusions

The main result of this paper is to stablish asymptotically, network independent performance for a distributed stochastic optimization method over directed graphs with message losses, delays, and asynchronous updates. Our work raises several open questions.

The most natural question raised by this paper concerns the size of the transients. How long must the nodes wait until the network-independent performance bound is achieved? The answer, of course, will depend on the network, but also on the number of nodes, the degree of asynchrony, and the delays. Understanding how this quantity scales is required before the algorithms presented in this work can be recommended to practitioners.

More generally, it is interesting to ask which problems in distributed optimization can achieve network-independent performance, even asymptotically. For example, the usual bounds for distributed subgradient descent (see, e.g., Nedic et al., 2018) depend on the spectral gap of the underlying network; various worst-case scalings with the number of nodes can be derived, and the final asymptotics are not network-independent. It is not immediately clear whether this is due to the analysis, or a fundamental limitation that will not be overcome.

Acknowledgments

The authors acknowledge support for this project by the AFOSR under grant FA9550-15-1-0394, by the ONR under grant N000014-16-1-224 and MURI N00014-19-1-2571, by the NSF under grants IIS-1914792, DMS-1664644, and CNS-1645681, and by the NIH under grant 1R01GM135930. A preliminary version of the results in Section 2 has been published in the proceedings of the American Control Conference 2018 (Olshevsky et al., 2018).

Appendix A. Proof of Lemma 4

Proof We use mathematical induction. For k = 0 we have xijl(0)=0, ∀l and uijx(0)=ϕix(0)=ρjix(0)=0. By (6) and the definition of uijx and xijl we obtain,

ρjix(1)=0,uijx(1)=(1l=1Γdτijl(0))ϕix(1),l=1Γdxijl(1)=(l=1Γdτijl(0))ϕix(1)

Equation (12) is concluded from first equation above and (13) results by summing up all three equations above.

Now assume this lemma is true for k = 0, …, K − 1. We want to show it will be true for k = K as well. In the following, LHS and RHS denote the left-hand-side and right-hand-side of (12) for k = K. By (6) we have,

LHS=l=1Γdτijl(Kl)[ϕix(K+1l)ρjix(K)].

Using (11) we obtain,

RHS=l=1Γdτijl(Kl)vijx(Kl).

Hence, it suffices to show that:

l=1Γdτijl(Kl)[ϕix(K+1l)ρjix(K)vijx(Kl)]=0. (50)

By part (e) of Assumption 1, at most one of the τijl(Kl), l = 1, …, Γd is non-zero. If all are zeros, the result follows. Now suppose τijl(Kl)=1 for some l. Equation (50) becomes,

ϕix(K+1l)ρjix(K)vijx(Kl)=0.

Plugging in the definition of vijx, after rearrangement we obtain,

ϕix(Kl)uijx(Kl)=ρjix(K). (51)

By the induction hypothesis, (12) holds for k = Kt, t = 1, …, l. Therefore,

ρjix(K+1t)ρjix(Kt)=xij1(Kt).

Hence,

ρjix(K)=ρjix(Kl)+t=1l(ρjix(K+1t)ρjix(Kt))=ρjix(Kl)+t=1lxij1(Kt)=ρjix(Kl)l=1dxijl(Kl)(Lemma3)=ρjix(Kl)+l=1dxijl(Kl).(Lemma2)

Moreover, by the induction hypothesis, (13) holds for k = Kl, thus,

ϕix(Kl)uijx(Kl)=ρjix(Kl)+l=1Γdxijl(Kl).

Combining the two relations above we conclude (51).

To show (13), consider the following equations which are direct results of the definitions and (12) that we just showed for k = K:

uijx(K+1)=(1l=1Γdτijl(K)vijx(K)),ρjix(K+1)=ρjix(K)+xij1(K),l=1Γdxijl(K+1)=l=2Γdxijl(K)+l=1Γdτijl(K)vijx(K).

Summing up both sides of the equations above we have,

LHS=uijx(K+1)+ρjix(K+1)+l=1Γdxijl(K+1),RHS=l=1Γdxijl(K)+ρjix(K)+vijx(K)=l=1Γdxijl(K)+ρjix(K)+uijx(K)ϕix(k)+ϕix(K+1)=ϕix(K+1).

The last equality holds because of the induction hypothesis (13) for k = K − 1, hence completing the proof. ■

Footnotes

1.

It goes without saying that no analysis of distributed optimization can be wholly independent of the network or the number of nodes. Indeed, in a network of n nodes, the diameter can be as large as n − 1, which means that, in the worst case, no bounds on global performance can be obtained during the first n − 1 steps of any algorithm.

2.

Note the difference between indexing in τijl and ρjix, which are both defined for link (i,j)ε.

References

  1. Agarwal Alekh and Duchi John C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011. [Google Scholar]
  2. Akbari Mohammad, Gharesifard Bahman, and Linder Tamas. Distributed online convex optimization on time-varying directed graphs. IEEE Transactions on Control of Network Systems, 4(3):417–428, 2017. [Google Scholar]
  3. Alpcan Tansu and Bauckhage Christian. A distributed machine learning framework. In 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 2546–2551. IEEE, 2009. [Google Scholar]
  4. Assran Mahmoud and Rabbat Michael. Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950, 2018. [Google Scholar]
  5. Bénézit Florence, Blondel Vincent, Thiran Patrick, Tsitsiklis John, and Vetterli Martin. Weighted gossip: Distributed averaging using non-doubly stochastic matrices. In 2010 IEEE International Symposium on Information Theory (ISIT), pages 1753–1757. IEEE, 2010. [Google Scholar]
  6. Brisimi Theodora S, Chen Ruidi, Mela Theofanie, Olshevsky Alex, Paschalidis Ioannis Ch, and Shi Wei. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chang Tsung-Hui, Hong Mingyi, Liao Wei-Cheng, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart I: algorithm and convergence analysis. IEEE Transactions on Signal Processing, 64(12):3118–3130, 2016a. [Google Scholar]
  8. Chang Tsung-Hui, Liao Wei-Cheng, Hong Mingyi, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart II: Linear convergence analysis and numerical performance. IEEE Transactions on Signal Processing, 64(12):3131–3144, 2016b. [Google Scholar]
  9. Chen Jianshu and Sayed Ali H. On the learning behavior of adaptive networkspart II: Performance analysis. IEEE Transactions on Information Theory, 61(6):3518–3548, 2015. [Google Scholar]
  10. Di Lorenzo Paolo and Scutari Gesualdo. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016. [Google Scholar]
  11. Dominguez-Garcia Alejandro D and Hadjicostis Christoforos N. Distributed matrix scaling and application to average consensus in directed graphs. IEEE Transactions on Automatic Control, 58(3):667–681, 2013. [Google Scholar]
  12. Domínguez-García Alejandro D and Hadjicostis Christoforos N. Convergence rate of a distributed algorithm for matrix scaling to doubly stochastic form. In 53rd IEEE Conference on Decision and Control, pages 3240–3245. IEEE, 2014. [Google Scholar]
  13. Feyzmahdavian Hamid Reza, Aytekin Arda, and Johansson Mikael. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016. [Google Scholar]
  14. Gharesifard Bahman and Cortes Jorge. Distributed strategies for generating weight-balanced and doubly stochastic digraphs. European Journal of Control, 18(6):539–557, 2012. [Google Scholar]
  15. Hadjicostis Christoforos N, Vaidya Nitin H, and Domínguez-García Alejandro D. Robust distributed average consensus via exchange of running sums. IEEE Transactions on Automatic Control, 61(6):1492–1507, 2016. [Google Scholar]
  16. Hadjicostis Christoforos N, Dominguez-Garcia Alejandro D, Charalambous Themistokis, et al. Distributed averaging and balancing in network systems: with applications to coordination and control. Foundations and Trends© in Systems and Control, 5(2–3): 99–292, 2018. [Google Scholar]
  17. He Shibo, Shin Dong-Hoon, Zhang Junshan, Chen Jiming, and Sun Youxian. Full-view area coverage in camera sensor networks: Dimension reduction and near-optimal solutions. IEEE Transactions on Vehicular Technology, 65(9):7448–7461, 2015. [Google Scholar]
  18. Hong Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM approach. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
  19. Kempe David, Dobra Alin, and Gehrke Johannes. Gossip-based computation of aggregate information. In Foundations of Computer Science, 2003. 44th Annual IEEE Symposium on, pages 482–491. IEEE, 2003. [Google Scholar]
  20. Koloskova Anastasiia, Stich Sebastian Urban, and Jaggi Martin. Decentralized stochastic optimization and gossip algorithms with compressed communication. Machine Learning Research, 97(CONF), 2019. [Google Scholar]
  21. Lan Guanghui, Lee Soomin, and Zhou Yi. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48, 2018. [Google Scholar]
  22. Li Mu, Andersen David G, Park Jun Woo, Smola Alexander J, Ahmed Amr, Josifovski Vanja, Long James. Shekita Eugene J, and Su Bor-Yiing. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014. [Google Scholar]
  23. Lian Xiangru, Huang Yijun, Li Yuncheng, and Liu Ji. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015. [Google Scholar]
  24. Lian Xiangru, Zhang Ce, Zhang Huan, Hsieh Cho-Jui, Zhang Wei, and Liu Ji. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017. [Google Scholar]
  25. Lian Xiangru, Zhang Wei, Zhang Ce, and Liu Ji. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (ICML), pages 3043–3052, 2018. [Google Scholar]
  26. Lobel Ilan and Ozdaglar Asuman. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291–1306, 2010. [Google Scholar]
  27. Mansoori Fatemeh and Wei Ermin. Superlinearly convergent asynchronous distributed network newton method. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2874–2879. IEEE, 2017. [Google Scholar]
  28. Morral Gemma, Bianchi Pascal, Fort Gersende, and Jakubowicz Jeremie. Distributed stochastic approximation: The price of non-double stochasticity. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pages 1473–1477. IEEE, 2012. [Google Scholar]
  29. Morral Gemma, Bianchi Pascal, and Fort Gersende. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017. [Google Scholar]
  30. Nedic Angelia. Asynchronous broadcast-based convex optimization over a network. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2011. [Google Scholar]
  31. Nedic Angelia and Olshevsky Alex. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015. [Google Scholar]
  32. Nedic Angelia and Olshevsky Alex. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947, 2016. [Google Scholar]
  33. Nedic Angelia and Ozdaglar Asuman. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [Google Scholar]
  34. Nedic Angelia, Olshevsky Alex, and Shi Wei. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4): 2597–2633, 2017. [Google Scholar]
  35. Nedic Angelia, Olshevsky Alex, and Rabbat Michael G. Network topology and communication-computation tradeoffs in decentralized optimization. IEEE, 106(5):953–976, 2018. [Google Scholar]
  36. Nemirovski Arkadi, Juditsky Anatoli, Lan Guanghui, and Shapiro Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009. [Google Scholar]
  37. Olshevsky Alex. Linear time average consensus and distributed optimization on fixed graphs. SIAM Journal on Control and Optimization, 55(6):3990–4014, 2017. [Google Scholar]
  38. Olshevsky Alex, Paschalidis Ioannis Ch, and Spiridonoff Artin Fully asynchronous push-sum with growing intercommunication intervals. American Control Conference, pages 591–596, 2018. [Google Scholar]
  39. Oreshkin Boris N, Coates Mark J, and Rabbat Michael G. Optimization and analysis of distributed averaging with short node memory. IEEE Transactions on Signal Processing, 58(5):2850–2865, 2010. [Google Scholar]
  40. Peng Zhouhua, Wang Jun, and Wang Dan. Distributed maneuvering of autonomous surface vehicles based on neurodynamic optimization and fuzzy approximation. IEEE Transactions on Control Systems Technology, 26(3):1083–1090, 2017. [Google Scholar]
  41. Pu Shi and Garcia Alfredo. A flocking-based approach for distributed stochastic optimization. Operations Research, 66(1):267–281, 2017. [Google Scholar]
  42. Pu Shi and Nedic Angelia. A distributed stochastic gradient tracking method. In 2018 IEEE Conference on Decision and Control (CDC), pages 963–968. IEEE, 2018. [Google Scholar]
  43. Qu Guannan and Li Na. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
  44. Qu Guannan and Li Na. Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 2019. [Google Scholar]
  45. Rakhlin Alexander, Shamir Ohad, Sridharan Karthik, et al. Making gradient descent optimal for strongly convex stochastic optimization. In 29th International Conference on Machine Learning (ICML), volume 12, pages 1571–1578. Citeseer, 2012. [Google Scholar]
  46. Ram S Sundhar, Nedic Angelia, and Veeravalli Venugopal V. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010. [Google Scholar]
  47. Recht Benjamin, Re Christopher, Wright Stephen, and Niu Feng. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [Google Scholar]
  48. Rennie Jason DM and Srebro Nathan. Loss functions for preference levels: Regression with discrete ordered labels. In IJCAI multidisciplinary workshop on advances in preference handling, pages 180–186. Kluwer; Norwell, MA, 2005. [Google Scholar]
  49. Scaman Kevin, Bach Francis, Bubeck Sebastien, Lee Yin Tat, and Massoulié Laurent. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In 34th International Conference on Machine Learning (ICML)-Volume 70, pages 3027–3036. JMLR.org, 2017. [Google Scholar]
  50. Shi Wei, Ling Qing, Wu Gang, and Yin Wotao. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [Google Scholar]
  51. Sirb Benjamin and Ye Xiaojing. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76–85. IEEE, 2016. [Google Scholar]
  52. Srivastava Kunal and Nedic Angelia. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011. [Google Scholar]
  53. Su Lili and Vaidya Nitin H. Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434. ACM, 2016a. [Google Scholar]
  54. Su Lili and Vaidya Nitin H. Non-bayesian learning in the presence of byzantine agents. In International Symposium on Distributed Computing, pages 414–427. Springer, 2016b. [Google Scholar]
  55. Su Lili and Vaidya Nitin H.. Reaching approximate byzantine consensus with multi-hop communication. Information and Computation, 255:352–368, 2017. ISSN 0890-5401. doi: 10.1016/j.ic.2016.12.003. URL http://www.sciencedirect.com/science/article/pii/S0890540116301262. SSS 2015. [DOI] [Google Scholar]
  56. Sun Ying, Scutari Gesualdo, and Palomar Daniel. Distributed nonconvex multiagent optimization over time-varying networks. In Signals, Systems and Computers, 2016 50th Asilomar Conference on, pages 788–794. IEEE, 2016. [Google Scholar]
  57. Tian Ye, Sun Ying, and Scutari Gesualdo. Asy-sonata: Achieving linear convergence in distributed asynchronous multiagent optimization. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 543–551. IEEE, 2018. [Google Scholar]
  58. Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1543–1550. IEEE, 2012a. [Google Scholar]
  59. Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Push-sum distributed dual averaging for convex optimization. In 2012 51st IEEE Conference on Decision and Control (CDC), pages 5453–5458. IEEE, 2012b. [Google Scholar]
  60. Tsitsiklis John, Bertsekas Dimitri, and Athans Michael. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986. [Google Scholar]
  61. Wu Tianyu, Yuan Kun, Ling Qing, Yin Wotao, and Sayed Ali H. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293–307, 2018. [Google Scholar]
  62. Xi Chenguang and Khan Usman A. Dextra: A fast algorithm for optimization over directed graphs. IEEE Transactions on Automatic Control, 62(10):4980–4993, 2017a. [Google Scholar]
  63. Xi Chenguang and Khan Usman A. Distributed subgradient projection algorithm over directed graphs. IEEE Transactions on Automatic Control, 62(8):3986–3992, 2017b. [Google Scholar]
  64. Xi Chenguang, Xin Ran, and Khan Usman A. Add-opt: Accelerated distributed directed optimization. IEEE Transactions on Automatic Control, 63(5):1329–1339, 2018. [Google Scholar]
  65. Xu Jinming, Zhu Shanying, Soh Yeng Chai, and Xie Lihua. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060. IEEE, 2015. [Google Scholar]
  66. Yuan Kun, Ling Qing, and Yin Wotao. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [Google Scholar]

RESOURCES