Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

Artin Spiridonoff; Alex Olshevsky; Ioannis Ch Paschalidis

. Author manuscript; available in PMC: 2020 Sep 27.

Published in final edited form as: J Mach Learn Res. 2020;21:58.

Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

Artin Spiridonoff ¹, Alex Olshevsky ¹, Ioannis Ch Paschalidis ¹

PMCID: PMC7520166 NIHMSID: NIHMS1608067 PMID: 32989377

Abstract

We consider the standard model of distributed optimization of a sum of functions $F (z) = \sum_{i = 1}^{n} f_{i} (z)$ , where node i in a network holds the function f_i(z). We allow for a harsh network model characterized by asynchronous updates, message delays, unpredictable message losses, and directed communication among nodes. In this setting, we analyze a modification of the Gradient-Push method for distributed optimization, assuming that (i) node i is capable of generating gradients of its function f_i(z) corrupted by zero-mean bounded–support additive noise at each step, (ii) F(z) is strongly convex, and (iii) each f_i(z) has Lipschitz gradients. We show that our proposed method asymptotically performs as well as the best bounds on centralized gradient descent that takes steps in the direction of the sum of the noisy gradients of all the functions f₁(z), …, f_n(z) at each step.

Keywords: distributed optimization, stochastic gradient descent

1. Introduction

Distributed systems have attracted much attention in recent years due to their many applications such as large scale machine learning (e.g., in the healthcare domain, Brisimi et al., 2018), control (e.g., maneuvering of autonomous vehicles, Peng et al., 2017), sensor networks (e.g., coverage control, He et al., 2015) and advantages over centralized systems, such as scalability and robustness to faults. In a network comprised of multiple agents (e.g., data centers, sensors, vehicles, smart phones, or various IoT devices) engaged in data collection, it is sometimes impractical to collect all the information in one place. Consequently, distributed optimization techniques are currently being explored for potential use in a variety of estimation and learning problems over networks.

This paper considers the separable optimization problem

\min_{z \in ℝ^{d}} F (z) ≔ \sum_{i = 1}^{n} f_{i} (z),

(1)

where the function $f_{i} : ℝ^{d} \to ℝ$ is held only by agent i in the network. We assume the agents communicate through a directed communication network, with each agent able to send messages to its out-neighbors. The agents seek to collaboratively agree on a minimizer to the global function F(z).

This fairly simple problem formulation is capable of capturing a variety of scenarios in estimation and learning. Informally, z is often taken to parameterize a model, and f_i(z) is a loss function measuring how well z matches the data held by agent i. Agreeing on a minimizer of F(z) means agreeing on a model that best explains all the data throughout the network – and the challenge is to do this in a distributed manner, avoiding techniques such as flooding which requires every node to learn and store all the data throughout the network. For more details, we refer the reader to the recent survey by Nedic et al. (2018).

In this work, we will consider a fairly harsh network environment, including message losses, delays, asynchronous updates, and directed communication. The function F(z) will be assumed to be strongly convex with the individual functions f_i(z) having a Lipschitz continuous gradient. We will also assume that, at every time step, node i can obtain a noisy gradient of its function f_i(z). Our goal will be to investigate to what extent distributed methods can remain competitive with their centralized counterparts in spite of these obstacles.

1.1. Literature Review

Research on models of distributed optimization dates back to the 1980s, see Tsitsiklis et al. (1986). The separable model of (1) was first formally analyzed in Nedic and Ozdaglar (2009), where performance guarantees on a fixed-stepsize subgradient method were obtained. The literature on the subject has exploded since, and we review here only the papers closely related to our work. We begin by discussing works that have focused on the effect of harsh network conditions.

A number of recent papers have studied asynchronicity in the context of distributed optimization. It has been noted that asynchronous algorithms are often preferred to synchronous ones, due to the difficulty of perfectly coordinating all the agents in the network, e.g., due to clock drift. Papers by Recht et al. (2011); Li et al. (2014); Agarwal and Duchi (2011); Lian et al. (2015) and Feyzmahdavian et al. (2016) study asynchronous parallel optimization methods in which different processors have access to a shared memory or parameter server. Recht et al. (2011) present a scheme called HOGWILD!, in which processors have access to the same shared memory with the possibility of overwriting each other’s work. Li et al. (2014) proposes a parameter server framework for distributed machine learning. Agarwal and Duchi (2011) analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information due to asynchrony. Lian et al. (2015) improve on the earlier work by Agarwal and Duchi (2011), and study two asynchronous parallel implementations of Stochastic Gradient (SG) for nonconvex optimization; establishing an $O_{k} (1 / \sqrt{k})$ convergence rate for both algorithms. Feyzmahdavian et al. (2016) propose an asynchronous mini-batch algorithm that eliminates idle waiting and allows workers to run at their maximal update rates.

The works mentioned above consider a centralized network topology, i.e., there is a central node (parameter server or shared memory) connected to all the other nodes. On the other hand, in a decentralized setting, nodes communicate with each other over a connected network without depending on a central node (see Figure 1). This setting reduces the communication load on the central node, is not vulnerable to failures of that node, and is more easily scalable.

Figure 1: — Different network topologies.

For analysis of how decentralized asynchronous methods perform we refer the reader to Mansoori and Wei (2017); Tsitsiklis et al. (1986); Srivastava and Nedic (2011); Assran and Rabbat (2018); Nedic (2011); Wu et al. (2018) and Tian et al. (2018). We note that of these works only Tian et al. (2018) is able to obtain an algorithm which agrees on a global minimizer of (1) with non-random asynchronicity, under the assumptions of strong convexity, noiseless gradients and possible delays. On the other hand, the papers Nedic (2011) and Wu et al. (2018) obtain convergence in this situation under assumptions of natural randomness in the algorithm: the former assumes randomly failing links while the latter assumes that nodes make updates in random order.

The study of distributed separable optimization over directed graphs was initiated in Tsianos et al. (2012b), where a distributed approach based on dual averaging with convex functions over a fixed graph was proposed and shown to converge at an $O_{k} (1 / \sqrt{k})$ rate. Some numerical results for such methods were reported in Tsianos et al. (2012a). In Nedic and Olshevsky (2015), a method based on plain gradient descent converging at a rate of $O_{k} ((\ln k) / \sqrt{k})$ was proposed over time-varying graphs. This was improved in Nedic and Olshevsky (2016) to $O_{k} ((\ln k) / k)$ for strongly convex functions with noisy gradient samples. More recent works on optimization over directed graphs are Akbari et al. (2017), which considered online convex optimization in this setting, and Assran and Rabbat (2018), which considered combining directed graphs with delays and asynchronicity. The main tool for distributed optimization is the so-called “push sum” method introduced in Kempe et al. (2003), which is widely used to design communication and optimization schemes over directed graphs. More recent references are Bénézit et al. (2010); Hadjicostis et al. (2016), which provide a more modern and general analysis of this method, and the most comprehensive reference on the subject is the recent monograph by Hadjicostis et al. (2018). We also mention Xi and Khan (2017a); Xi et al. (2018); Nedic et al. (2017), where an approach based on push-sum was explored. A parallel line of work in this setting based on the the ADMM model, where updates are allowed to include a local minimization step, was explored in Brisimi et al. (2018); Chang et al. (2016a,b) and Hong (2017).

The reason directed graphs present a problem is because much of distributed optimization relies on the primitive of “multiplication by a doubly stochastic matrix:” given that each node of a network holds a number x_i, the network needs to compute y_i where x = (x₁, …, x_n)^⊺, y = (y₁, …, y_n)^⊺ and y = Wx for some doubly stochastic matrix W with positive spectral gap. This is pretty easy to accomplish over undirected graphs (see Nedic et al., 2018) but not immediate over directed graphs. A parallel line of research focuses on distributed methods for constructing such doubly stochastic matrices over directed graphs – we refer the reader to Dominguez-Garcia and Hadjicostis (2013); Gharesifard and Cortés (2012); Domínguez-García and Hadjicostis (2014). Unfortunately, to the authors’ best knowledge, no explicit and favorable convergence time guarantees are known for this procedure. Another line of work (Xi and Khan, 2017b) takes a similar approach, based on construction of a doubly stochastic matrix with positive spectral gap after the introduction of auxiliary states. Among works with undirected graphs, Scaman et al. (2017) derived the optimal convergence rates for smooth and strongly convex functions and introduced the multi-step dual accelerated (MSDA) algorithm with optimal linear convergence rate in the deterministic case.

Dealing with message losses has always been a challenging problem for multi-agent optimization protocols. Recently, Hadjicostis et al. (2016) resolved this issue rather elegantly for the problem of distributed average computation by having nodes exchange certain running sums. It was shown in Hadjicostis et al. (2016) that the introduction of these running sums is equivalent to a lossless algorithm on a slightly modified graph. We also refer the reader to the follow-up papers by Su and Vaidya (2016b,a, 2017). We will use the same approach in this work to deal with message losses.

In many applications, calculating the exact gradients can be computationally very expensive or impossible Lan et al. (2018). In one possible scenario, nodes are sensors that collect measurements at every step, which naturally corrupts all the data with noise. Alternatively, communication between agents may insert noise into information transmitted between them. Finally, when f_i(z) measures the fit of a model parameterized by the vector z to the data of agent i, it may be efficient for agent i to randomly select a subset of its data and compute an estimate of the gradient based on only those data points (Alpcan and Bauckhage, 2009). Motivated by these considerations, a literature has arisen studying the effects of stochasticity in the gradients. For example, Srivastava and Nedic (2011) showed convergence of an asynchronous algorithm for constrained distributed stochastic optimization, under the presence of local noisy communication in a random communication network. In Pu and Nedic (2018), two distributed stochastic gradient methods were introduced, and their convergence to a neighborhood of the global minimum (under constant step-size) and to the global minimum (under diminishing stepsize) was analyzed. In work by Sirb and Ye (2016), convergence of asynchronous decentralized optimization using delayed stochastic gradients has been shown.

The algorithms we will study here for stochastic gradient descent are based on the standard “consensus+gradient descent” framework: nodes will take steps in the direction of their gradients and then “reconcile” these steps by moving in the directions of an average of their neighbors in the graph. We refer the reader to Nedic et al. (2018); Yuan et al. (2016), for a more recent and simplified analysis of such methods. It is also possible to take a more modern approach, pioneered in Shi et al. (2015), of using the past history to make updates; such schemes have been shown to achieve superior performance in recent years (see Shi et al., 2015; Sun et al., 2016; Oreshkin et al., 2010; Nedic et al., 2017; Xi and Khan, 2017a; Xi et al., 2018; Qu and Li, 2017; Xu et al., 2015; Qu and Li, 2019; Di Lorenzo and Scutari, 2016); we refer the reader to Pu and Nedic (2018) which took this approach.

One of our main concerns in this paper is to develop decentralized optimization methods which perform as well as their centralized counterparts. Specifically, we will compare the performance of a distributed method for (1) on a network of n nodes with the performance of a centralized method which, at every step, can query all n gradients of the functions f₁(z), …, f_n(z). Since the distributed algorithm gets noise-corrupted gradients, so should the centralized method. Thus, the natural approach is to compare the distributed method to centralized gradient descent which moves in the direction of the sum of the gradients of f₁(z), …, f_n(z).This method of comparison keeps the “computational power” of the two nodes identical.

Traditionally, the bounds derived on distributed methods were considerably worse than those derived for centralized methods. For example, the papers by Nedic and Olshevsky (2015, 2016) had bounds for distributed optimization over directed graphs that were worse than the comparable centralized method (in terms of rate of error decay) by a multiplicative factor that, in the worst case, could be as large as $n^{O (n)}$ . This is typical over directed graphs, though better results are possible over undirected graphs. For example, in Olshevsky (2017), in the model of noiseless, undelayed, synchronous communication over an undirected graph, a distributed subgradient method was proposed whose performance, relative to a centralized method with the same computational power, was worse by a multiplicative factor of n.

The breakthrough papers by Chen and Sayed (2015); Pu and Garcia (2017); Morral et al. (2017), were the first to address this gap. These papers studied the model where gradients are corrupted by noise, which we also consider in this paper. Chen and Sayed (2015) examined the mean-squared stability and convergence of distributed strategies with fixed step-size over graphs and showed the same performance level as that of a centralized strategy, in the small step-size regime. In Pu and Garcia (2017) it was shown that, for a certain stochastic differential equation paralleling network gradient descent, the performance of centralized and distributed methods were comparable. In Morral et al. (2017), it was proved, for the first time, that distributed gradient descent with an appropriately chosen step-size, asymptotically performs similarly to a centralized method that takes steps in the direction of the sum of the noisy gradients, assuming iterates will remain bounded almost surely. This was the first analysis of a decentralized method for computing the optimal solution with performance bounds matching its centralized counterpart.

Both Pu and Garcia (2017) and Morral et al. (2017) were over fixed, undirected graphs with no message loss or delays or asynchronicity. As shown in the paper by Morral et al. (2012), this turns out to be a natural consequence of the analysis of those methods. Indeed, on a technical level, the advantage of working over undirected graphs is that they allow for easy distributed multiplication by doubly-stochastic matrices; it was shown in Morral et al. (2012) that if this property holds only in expectation – that is, if the network nodes can multiply by random stochastic matrices that are only doubly stochastic in expectation – distributed gradient descent will not perform comparably to its centralized counterpart.

In parallel to this work, and in order to reduce communication bottlenecks, Koloskova et al. (2019) propose a decentralized SGD with communication compression that can achieve the centralized baseline convergence rate, up to a constant factor. When the objective functions are smooth but not necessarily convex, Lian et al. (2017) show that Decentralized Parallel Stochastic Gradient Descent (D-PSGD) can asymptotically perform comparably to Centralized PSGD in total computational complexity. However, they argue that D-PSGD requires much less communication cost on the busiest node and hence, can outperform C-PSGD in certain communication regimes. Again, both Koloskova et al. (2019) and Lian et al. (2017) are over fixed undirected graphs, without delays, link failures or asynchronicity. The follow-up work by Lian et al. (2018), extends the D-PSGD to the asynchronous case.

1.2. Our Contribution

We propose an algorithm which we call Robust Asynchronous Stochastic Gradient Push (RASGP) for distributed optimization from noisy gradient samples over directed graphs with message losses, delays, and asynchronous updates. We will assume gradients are corrupted with additive noise represented by independent random variables, with bounded support, and with finite variance at node i denoted by $σ_{i}^{2}$ . Our main result is that the RASGP performs as well as the best bounds on centralized gradient descent that moves in the direction of the sum of noisy gradients of f₁(z), …, f_n(z). Our results also hold if the underlying graphs are time-varying as long as there are no message losses. We give a brief technical overview of this result next.

We will assume that each function f_i(z) is μ_i-strongly convex with L_i-Lipschitz gradient, where ∑_i μ_i and L_i > 0, i = 1, …, n. The RASGP will have every node maintain an estimate of the optimal solution which will be updated from iteration to iteration; we will use z_i(k) to denote the value of this estimate held by node i at iteration k. We will show that, for each node i = 1, …, n,

E [{‖ z_{i} (k) - z^{*} ‖}_{2}^{2}] = \frac{Γ_{u} \sum_{i = 1}^{n} σ_{i}^{2}}{k {(\sum_{i = 1}^{n} μ_{i})}^{2}} + O_{k} (\frac{1}{k^{1.5}}),

(2)

where z* ≔ arg min F(z) and Γ_u is the degree of asynchronicity, defined as the maximum number of iterations between two consecutive updates of any agent. The leading term matches the best bounds for (centralized) gradient descent that takes steps in the direction of the sum of the noisy gradients of f₁(z), …, f_n(z), every k/Γ_u iterations (see Nemirovski et al., 2009; Rakhlin et al., 2012). Asymptotically, the performance of the RASGP is network independent: indeed, the only effect of the network or the number of nodes is on the constant factor within the $O_{k} (1 / k^{1.5})$ term above. The asymptotic scaling as $O_{k} (1 / k)$ is optimal in this setting (Rakhlin et al., 2012).

Consider the case when all the functions are identical, i.e., f₁(z) = ⋯ = f_n(z), and Γ_u = 1. In this case, letting μ = μ_i and σ = σ_i, we have that for each i = 1, …, n, (2) reduces to

E [{‖ z_{i} (k) - z^{*} ‖}_{2}^{2}] = \frac{σ^{2} / n}{k μ^{2}} + O_{k} (\frac{1}{k^{1.5}}) .

In other words, asymptotically we get the variance reduction of a centralized method that simply averages the n noisy gradients at each step.

The implication of this result is that one can get the benefit of having n independent processors computing noisy gradients in spite of all the usual problems associated with communications over a network (i.e., message losses, latency, asynchronous updates, oneway communication). Of course, the caveat is that one must wait sufficiently long for the asymptotic decay to “kick in,” i.e., for the second term on the right-hand side of (2) to become negligible compared to the first. We leave the analysis of the size of this transient period to future work and note here that it will depend on the network and the number of nodes.¹

The RASGP is a variation on the usual distributed gradient descent where nodes mix consensus steps with steps in the direction of their own gradient, combined with a new step-size trick to deal with asynchrony. It is presented as Algorithm 3 in Section 3. For a formal statement of the results presented above, we refer the reader to Theorem 15 in the body of the paper.

We briefly mention two caveats. The first is that implementation of the RASGP requires each node to use the quantity $\sum_{i = 1}^{n} μ_{i} / n$ in setting its local stepsize. This is not a problem in the setting when all functions are the same, but, otherwise, $\sum_{i = 1}^{n} μ_{i} / n$ is a global quantity not immediately available to each node. Assuming that node i knows μ_i one possibility is to use average consensus to compute this quantity in a distributed manner before running the RASGP (for example using the algorithm described in Section 2 of this paper). The second caveat is that, like all algorithms based on the push-sum method, the RASGP requires each node to know its out-degree in the communication graph.

1.3. Organization of This Paper

We conclude this Introduction with Section 1.4, which describes the basic notation we will use throughout the remainder of the paper. Section 2 does not deal directly with the distributed optimization problem we have discussed, but rather introduces the problem of computing the average in the fairly harsh network setting we will consider in this paper. This is an intermediate problem we need to analyze on the way to our main result. Section 3 provides the RASGP algorithm for distributed optimization, and then states and proves our main result, namely the asymptotically network-independent and optimal convergence rate. Results from numerical simulations of our algorithm to illustrate its performance are provided in Section 4, followed by conclusions in Section 5.

1.4. Notations and Definitions

We assume there are n agents $V = {1, ..., n}$ , communicating through a fixed directed graph $G = (V, E)$ , where $E$ is the set of directed arcs. We assume $G$ does not have self-loops and is strongly connected.

For a matrix A, we will use A_ij to denote its (i,j)th entry. Similarly, v_i and [v]_i will denote the ith entry of a vector v. A matrix is called stochastic if it is non-negative and the sum of the elements of each row equals to one. A matrix is column stochastic if its transpose is stochastic. To a non-negative matrix $A \in ℝ^{n \times n}$ we associate a directed graph $G_{A}$ with vertex set $V_{A} = {1, 2, ..., n}$ and edge set $E_{A} = {(i, j) | A_{j i} > 0}$ . In general, such a graph might contain self-loops. Intuitively, this graph corresponds to the information flow in the update x(k + 1) = Ax(k); indeed, $(i, j) \in E_{A}$ if the jth coordinate of x(k + 1) depends on the ith coordinate of x(k) in this update.

Given a sequence of matrices A(0), A(1), A(2), …, we denote by A^k₂:k₁, k₂ ≥ k₁, the product of elements k₁ to k₂ of the sequence, inclusive, in the following order:

A^{k_{2} : k_{1}} = A (k_{2}) A (k_{2} - 1) \dots A (k_{1}) .

Moreover, A^k:k = A(k).

Node i is an in-neighbor of node j, if there is a directed link from i to j. Hence, j would be an out-neighbor of node i. We denote the set of in-neighbors and out-neighbors of node i by $N_{i}^{-}$ and $N_{i}^{+}$ , respectively. Moreover, we denote the number of in-neighbors and out-neighbors of node i with $d_{i}^{-}$ and $d_{i}^{+}$ , as its in-degree and out-degree, respectively.

By x_min and x_max we denote min_i x_i and max_i x_i respectively, over all possible indices unless mentioned otherwise. We denote a n × 1 column vector of all ones or zeros by 1_n and 0_n, respectively. We will remove the subscript when the size is clear from the context.

Let $v \in ℝ^{d}$ be a vector. We denote by $v^{-} \in ℝ^{d}$ a vector of the same length such that

v_{i}^{-} = {\begin{matrix} 1 / v_{i}, \\ 0, \end{matrix} \begin{matrix} if v_{i} \neq 0, \\ if v_{i} = 0. \end{matrix}

For all the algorithms we describe, we sometimes use the notion of mass to denote the value an agent holds, sends or receives. With that in mind, we can think of a value being sent from one node, as a mass being transferred.

We use ‖.‖_p to denote the l_p-norm of a vector. We sometimes drop the subscript when referring to the Euclidean l₂ norm.

2. Push-Sum with Delays and Link Failures

In this section we introduce the Robust Asynchronous Push-Sum algorithm (RAPS) for distributed average computation and prove its exponential convergence. Convergence results proved for this algorithm will be used later when we turn to distributed optimization. The algorithm relies heavily on ideas from Hadjicostis et al. (2016) to deal with message losses, delays, and asynchrony. The conference version of this paper Olshevsky et al. (2018) developed RAPS for the delay-free case, and this section may be viewed as an extension of that work.

Pseudocode for the algorithm is given in the box for Algorithm 1. We begin by outlining the operation of the algorithm. Our goal in this section is to compute the average of vectors, one held by each node in the network, in a distributed manner. However, since the RAPS algorithm acts separately in each component, we may, without loss of generality, assume that we want to average scalars rather than vectors. The scalar held by node i will be denoted by x_i(0).

Without loss of generality, we define an iteration by descretizing time into time slots indexed by k = 0,1,2,…. We assume that during each time slot every agent makes at most one update and processes messages sent in previous time slots.

In the setting of no message losses, no delays, no asynchrony, and a fixed, regular, undirected communication graph, the RAPS can be shown to be equivalent to the much simpler iteration

x (t + 1) = Wx (t),

where W is an irreducible, doubly stochastic matrix with positive diagonal; standard Markov chain theory implies that $x_{i} (t) \to (1 / n) \sum_{i = 1}^{n} x_{i} (t)$ in this setting. RAPS does essentially the same linear update, but with a considerable amount of modifications. In particular, we use the central idea of the classic push-sum method (Kempe et al., 2003) to deal with directed communication, which suggests to have a separate update equation for the y-variables, which informs us how we should rescale the x-variables; as well as the central idea of Hadjicostis et al. (2018), which is to repeatedly broadcast sums of previous messages to provide robustness against message loss. While the algorithm in Hadjicostis et al. (2018) handles message losses in a synchronous setting, RAPS can handle delays as well as asynchronicity.

Before getting into details, let us provide a simple intuition behind the RAPS algorithm. Each agent i holds a value (mass) x_i and y_i. At the beginning of every iteration, i wants to split its mass between itself and its out-neighbors $j \in N_{i}^{+}$ . However, to handle message losses, it sends the accumulated x and y mass (running sums which we denote by $ϕ_{i}^{x}$ and $ϕ_{i}^{y}$ ), that i wants to transfer to each of its neighbors, from the start of the algorithm. Therefore, when a neighbor j receives a new accumulated mass from i, it stores it at $ρ_{j i}^{*}$ and by subtracting the previous accumulated mass ρ_ji it had received from i, j obtains all the mass that i has been trying to send since its last successful communication. Then, j updates its x and y mass by adding the new received masses, and finally, updates its estimate of the average to x/y. To handle delays and asynchronicity, timestamps κ_i are attached to messages outgoing from i.

The pseudocode for the algorithm may appear complicated at first glance; this is because of the considerable complexity required to deal with directed communications, message losses, delays, and asynchrony.

We next describe the algorithm in more detail. First, in the course of executing the algorithm, every agent i maintains scalar variables x_i, y_i, z_i, $ϕ_{i}^{x}$ , $ϕ_{i}^{y}$ , κ_i, $ρ_{ij}^{x}$ , $ρ_{ij}^{y}$ and κ_ij for $(j, i) \in E$ . The variables x_i and y_i have the same evolution, however y_i is initialized as 1. Therefore, to save space in describing and analyzing the algorithm, we will use the symbol θ, when a statement holds for both x and y. Similarly, when a statement is the same for both variables x and y, we will remove the superscripts x or y. For example, the initialization ρ_ji(0) = 0 in the beginning of the algorithm means both $ρ_{j i}^{x} (0) = 0$ and $ρ_{j i}^{y} (0) = 0$ .

We briefly mention the intuitive meaning of the various variables. The number z_i represents node i’s estimate of the initial average. The counter $ϕ_{i}^{θ} (k)$ is the total θ-value sent by i to each of its neighbors from time 0 to k − 1. Similarly, $ρ_{i j}^{θ} (k)$ is the total θ value that i has received from j up to time k − 1. The integer κ_i is a timestamp that i attaches to its messages, and the number κ_ij tracks the latest timestamp i has received from j.

To obtain an intuition for how the algorithm uses the counters $ϕ_{i}^{θ} (k)$ and $ρ_{i j}^{θ} (k)$ , note that, in line 15 of the algorithm, node i effectively figures out the last θ value sent to it by each of its in-neighbors j, by looking at the increment to the $ρ_{i j}^{θ}$ . This might seem needlessly involved, but, the underlying reason is that this approach introduces robustness to message losses.

Algorithm 1.

Robust Asynchronous Push-Sum (RAPS)

1:	Initialize the algorithm with y(0) = 1, ϕ_i(0) = 0, ∀i ∈ {1, …, n} and ρ_ij(0) = 0, κ_ij(0) = 0, $\forall (j, i) \in E$ .
2:	At every iteration k = 0, 1, 2, …, for every node i:
3:	if node i wakes up then
4:	κ_i ← k;
5:	$ϕ_{i}^{x} \leftarrow ϕ_{i}^{x} + \frac{x_{i}}{d_{i}^{+} + 1}$ , $ϕ_{i}^{y} \leftarrow ϕ_{i}^{y} + \frac{y_{i}}{d_{i}^{+} + 1}$ ;
6:	$x_{i} \leftarrow \frac{x_{i}}{d_{i}^{+} + 1}$ , $y_{i} \leftarrow \frac{y_{i}}{d_{i}^{+} + 1}$ ;
7:	Node i broadcasts ( $ϕ_{i}^{x}, ϕ_{i}^{y}$ ,κ_i) to its out-neighbors in $N_{i}^{+}$ .
8:	Processing the received messages
9:	for $(ϕ_{j}^{x}, ϕ_{j}^{y}, κ_{j}^{'})$ in the inbox do
10:	if ${κ^{'}}_{j} > κ_{i j}$ then
11:	$ρ_{i j}^{* x} \leftarrow ϕ_{j}^{x}$ , $ρ_{i j}^{* y} \leftarrow ϕ_{j}^{y}$ ;
12:	$κ_{i j} \leftarrow κ_{j}^{'}$ ;
13:	end if
14:	end for
15:	$x_{i} \leftarrow x_{i} + \sum_{j \in N_{i}^{-}} (ρ_{ij}^{* x} - ρ_{ij}^{x})$ , $y_{i} \leftarrow y_{i} + \sum_{j \in N_{i}^{-}} (ρ_{ij}^{* y} - ρ_{ij}^{y})$ ;
16:	$ρ_{i j}^{x} \leftarrow ρ_{i j}^{* x}$ , $ρ_{i j}^{y} \leftarrow ρ_{i j}^{* y},$
17:	$z_{i} \leftarrow \frac{x_{i}}{y_{i}}$ ;
18:	end if
19:	Other variables remain unchanged.

Open in a new tab

We next describe in words what the pseudocode above does. At every iteration k, if agent i wakes up, it performs the following actions. First, it divides its values x_i, y_i into $d_{i}^{+} + 1$ parts and broadcasts these to its out-neighbors; actually, what it broadcasts are the accumulated running sums $ϕ_{i}^{x}$ and $ϕ_{i}^{y}$ . Following Kempe et al. (2003), this is sometimes called the “push step.”

Then, node i moves on to process the messages in its inbox in the following way. If agent i has received a message from node j that is newer than the last one it has received before, it will store that message in $ρ_{i j}^{*}$ and discard the older messages. Next, i updates its x and y variables by adding the difference of $ρ_{i j}^{*}$ with the older value ρ_ij, for all in-neighbors j. As mentioned above, this difference is equal to the new mass received. Next, $ρ_{i j}^{*}$ overwrites ρ_ij in the penultimate step. The last step of the algorithm sets z_i to be the rescaled version of x_i: z_i = x_i/y_i.

In the remainder of this section, we provide an analysis of the RAPS algorithm, ultimately showing that it converges geometrically to the average in the presence of message losses, asynchronous updates, delays, and directed communication. Our first step is to formulate the RAPS algorithm in terms of a linear update (i.e., a matrix multiplication), which we do in the next subsection.

2.1. Linear Formulation

Next we show that, after introducing some new auxiliary variables, Algorithm 1 can be written in terms of a classical push-sum algorithm (Kempe et al., 2003) on an augmented graph. Since the y-variables have the same evolution as the x-variables, here we only analyze the x-variables.

In our analysis, we will associate with each message an effective delay. If a message is sent at time k₁ and is ready to be processed at time k₂, then k₂ − k₁ ≥ 1 is the effective delay experienced by that message. Those messages that are discarded will not have an effective delay associated with them and are considered as lost.

Next, we will state our assumptions on connectivity, asynchronicity, and message loss.

Assumption 1 Suppose:

Graph $G$ is strongly connected and does not have self-loops.
The delays on each link are bounded above by some Γ_del ≥ 1.
Every agent wakes up and performs updates at least once every Γ_u ≥ 1 iterations.
Each link fails at most Γ_f ≥ 0 consecutive times.
Messages arrive in the order of time they were sent. In other words, if messages are sent from node i to j at times k₁ and k₂ with (effective) delays d₁ and d₂, respectively, and k₁ < k₂, then we have k₁ + d₁ < k₂ + d₂.

One consequence of Assumption 1 is that the effective delays associated with each message that gets through are bounded above by Γ_d ≔ Γ_del + Γ_u − 1. Another consequence is that for each $(i, j) \in E$ , j receives a message from i successfully, at least once every Γ_s iterations where

Γ_{s} ≔ Γ_{u} (Γ_{f} + 1) + Γ_{d} \geq 2 .

(3)

Part (e) of Assumption 1 can be assumed without loss of generality. Indeed, observe that outdated messages automatically get discarded in Line 10 of our algorithm. For simplicity, it is convenient to think of those messages as lost. Thus, if this assumption fails in practice, the algorithm will perform exactly as if it had actually held in practice due to Line 10. Making this an assumption, rather than a proposition, lets us slightly simplify some of the arguments and avoid some redundancy throughout this paper.

Let us introduce the following indicator variables: τ_i(k) for i ∈ {1, …, n} which equals to 1 if node i wakes up at time k, and equals 0 otherwise. Similarly, $τ_{i j}^{l} (k)$ , for $(i, j) \in E$ , 1 ≤ l ≤ Γ_d which is 1 if τ_i(k) = 1 and the message sent from node i to j at time k will arrive after experiencing an effective delay of l.² Note that if node i wakes up at time k but the message it sends to j is lost, then $τ_{i j}^{l} (k)$ will be zero for all l.

We can rewrite the RAPS algorithm with the help of these indicator variables. Let us adopt the notation that x_i(k) refers to x_i at the beginning of round k of the algorithm (i.e., before node i has a chance to go through the list of steps outlined in the algorithm box). We will use the same convention with all of the other variables, e.g., y_i(k), z_i(k), etc. If node i does not wake up at round k, then of course x_i(k + 1) = x_i(k).

Now observe that we can write

ϕ_{i}^{x} (k + 1) - ϕ_{i}^{x} (k) = τ_{i} (k) \frac{x_{i} (k)}{d_{i}^{+} + 1} .

(4)

Likewise, we have

x_{i} (k + 1) = x_{i} (k) (1 - τ_{i} (k) + \frac{τ_{i} (k)}{d_{i}^{+} + 1}) + \sum_{j \in N_{i}^{-}} (ρ_{i j}^{x} (k + 1) - ρ_{i j}^{x} (k)),

(5)

which can be shown by considering each case (τ_i(k) = 1 or 0); note that we have used the fact that, in the event that node i wakes up at time k, the variable $ρ_{i j}^{x} (k + 1)$ equals the variable $ρ_{i j}^{* x}$ during execution of Line 16 of the algorithm at time k.

Finally, we have that $\forall (i, j) \in E$ , the flows $ρ_{j i}^{x}$ are updated as follows:

ρ_{j i}^{x} (k + 1) = ρ_{j i}^{x} (k) + \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (k - l) (ϕ_{i}^{x} (k + 1 - l) - ρ_{j i}^{x} (k)),

(6)

where we make use of the fact that the sum contains only a single nonzero term, since the messages arrive monotonically. To parse the indices in this equation, note that node i actually broadcasts $ϕ_{i}^{x} (k + 1 - l)$ in our notation at iteration k − l; by our definitions, $ϕ_{i}^{x} (k - l)$ is the value of $ϕ_{i}^{x}$ at the beginning of that iteration. To simplify these relations, we introduce the auxiliary variables $u_{i j}^{x}$ for all $(i, j) \in E$ , defined through the following recurrence relation:

u_{i j}^{x} (k + 1) ≔ (1 - \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (k)) (u_{i j}^{x} (k) + ϕ_{i}^{x} (k + 1) - ϕ_{i}^{x} (k)),

(7)

and initialized as $u_{i j}^{x} (0) ≔ 0$ . Intuitively, the variables $u_{i j}^{x}$ represent the “excess mass” of x_i that is yet to reach node j. Indeed, this quantity resets to zero whenever a message is sent that arrives at some point in the future, and otherwise is incremented by adding the broadcasted mass that is lost. Note that node i never knows $u_{i j}^{x} (k)$ , since it has no idea which messages are lost, and which are not; nevertheless, for purposes of analysis, nothing prevents us from considering these variables.

Let us also define the related quantity,

v_{i j}^{x} (k) ≔ u_{i j}^{x} (k) + ϕ_{i}^{x} (k + 1) - ϕ_{i}^{x} (k), for k \geq 0,

and $v_{i j}^{x} (k) ≔ 0$ for k < 0. Intuitively, this quantity may be thought of as a forward-looking estimate of the mass that will arrive at node j, if the message sent from node i at time k gets through; correspondingly, it includes not only the previous unsent mass, but the extra mass that will be added at the current iteration.

The key variables for the analysis of our method are the variables we will denote by $x_{i j}^{l} (k)$ . Intuitively, every time a message is sent, but gets lost, we imagine that it has instead arrived into a “virtual node” which holds that mass; once the next message gets through, we imagine that the virtual node has forwarded that mass to its intended destination. This idea originates from Hadjicostis et al. (2016). Because of the delays, however, we need to introduce Γ_d virtual nodes for each such event. If a message is sent from i and arrives at j with effective delay l, we will instead imagine it is received by the virtual node $b_{i j}^{l}$ , then sent to $b_{i j}^{l - 1}$ at the next time step, and so forth until it reaches $b_{i j}^{1}$ , and is then forwarded to its destination. These virtual nodes are defined formally later.

Putting that intuition aside, we formally define the variables $x_{i j}^{l} (k)$ via the following set of recurrence relations:

x_{i j}^{l} (k + 1) ≔ τ_{i j}^{l} (k) v_{i j}^{x} (k), l = Γ_{d},

(8)

x_{i j}^{l} (k + 1) ≔ τ_{i j}^{l} (k) v_{i j}^{x} (k) + x_{i j}^{l + 1} (k), 1 \leq l < Γ_{d},

(9)

and $x_{i j}^{l} (k) ≔ 0$ when both k ≤ 0 and l = 1, …, Γ_d. To parse these equations, imagine what happens when a message is sent from i to j with effective delay of Γ_d at time k. The content of this message becomes the value of $x_{i j}^{Γ_{d}}$ according to (8); and, in each subsequent step, influences $x_{i j}^{Γ_{d} - 1}, x_{i j}^{Γ_{d} - 2}$ , and so forth according to (9). Putting (8) and (9) together, we obtain

x_{i j}^{l} (k) = \sum_{t = 1}^{Γ_{d} - l + 1} τ_{i j}^{t + l - 1} (k - t) v_{i j}^{x} (k - t),

(10)

and particularly,

x_{i j}^{1} (k) = \sum_{t = 1}^{Γ_{d}} τ_{i j}^{t} (k - t) v_{i j}^{x} (k - t) .

(11)

Note that, as is common in many of the equations we will write, only a single term in the sums can be nonzero (this is not obvious at this point and is a result of Lemma 1).

Before proceeding to the main result of this section, we state the following lemma, whose proof is immediate.

Lemma 1 If $τ_{i j}^{l} (k) = 1$ , the following statements are satisfied:

$τ_{i j}^{l^{'}} (k) = 0$ for l′ ≠ l.
If l > 0, then $τ_{i j}^{s} (k + t) = 0$ for t = 1, …, l and s = 0, …, l − t.
If l < Γ_d, then $τ_{i j}^{s} (k - t) = 0$ for t = 1, …, Γ_d − l and s = l + t, …, Γ_d.

Lemma 2 If $τ_{i j}^{l} (k) = 1$ then $x_{i j}^{l^{'}} (k) = 0$ for l′ > l.

Proof By Lemma 1(c), $τ_{i j}^{t + l^{'} - 1} (k - t) = 0$ for t ∈ {1, …, Γ_d − l′ + 1}. Hence, by (10) we have,

x_{i j}^{l^{'}} (k) = \sum_{t = 1}^{Γ_{d} - l^{'} + 1} τ_{i j}^{t + l^{'} - 1} (k - t) v_{i j}^{x} (k - t) = 0.

The next lemma is essentially a restatement of the observation that the content of every $x_{i j}^{l^{'}}$ eventually “passes through” $x_{i j}^{1}$ .

Lemma 3 If $τ_{i j}^{l} (k - l) = 1$ , l ≥ 1, we have,

\sum_{l^{'} = 1}^{l} x_{i j}^{l^{'}} (k - l) = \sum_{t = 1}^{l} x_{i j}^{1} (k - t) .

Proof We will show $x_{i j}^{1} (k - t) = x_{i j}^{l - t + 1} (k - l)$ for t = 1, …, l. For t = l the equality is trivial. Now suppose t < l. By Lemma 1(a) we have $τ_{i j}^{l - t} (k - l) = 0$ . Moreover, by part (b) of the same lemma we have, $τ_{i j}^{s^{'}} (k - l + t^{'}) = 0$ for t′ = l − t − 1 and s′ = l−t−t′. Hence, $x_{i j}^{l - t - t^{'} + 1} (k - l + t^{'}) = x_{i j}^{l - t - t^{'}} (k - l + t^{'} + 1)$ . Combining these equations for t′ = 0, …, l−t−1, we get $x_{i j}^{1} (k - t) = x_{i j}^{l - t + 1} (k - l)$ . ■

The following lemma is the key step of a linear formulation of RAPS.

Lemma 4 For k = 0,1,… and $(i, j) \in E$ we have:

ρ_{j i}^{x} (k + 1) - ρ_{j i}^{x} (k) = x_{i j}^{1} (k),

(12)

u_{i j}^{x} (k + 1) + ρ_{j i}^{x} (k + 1) + \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (k + 1) = ϕ_{i}^{x} (k + 1) .

(13)

Parsing these equations, (12) simply states that the value of $x_{i j}^{1} (k)$ can be thought of as impacting $ρ_{j i}^{x}$ at time k; recall that the content of $x_{i j}^{1} (k)$ is a message that was sent from node i to j at time k − l with an effective delay of l, for some 1 ≤ l ≤ Γ_d (cf. Equation 11). On the other hand, (13) may be thought of a “conservation of mass” equation. All the mass that has been sent out by node i has either: (i) been lost (in which case it is in $u_{i j}^{x}$ ), (ii) affected node j (in which case it is in $ρ_{j i}^{x}$ ), or (iii) is in the process of reaching node j but delayed (in which case it is in some $x_{i j}^{l}$ ).

Although this lemma is arguably obvious, a formal proof is surprisingly lengthy. For this reason, we relegate it to the Appendix.

We next write down a matrix form of our updates. As a first step, define the (n + m′) × 1 column vector χ(k) ≔ [x(k)^⊺, x¹(k)^⊺, …, x^Γ_d(k)^⊺,u^x(k)^⊺]^⊺, where m′ ≔ (Γ_d+1)m, $m ≔ | E |$ , x(k) collects all x_i(k), x^l(k) collects all $x_{i j}^{l} (k)$ and, u^x(k) collects all $u_{i j}^{x} (k)$ . Define ψ(k) by collecting y-values similarly.

Now, we have all the tools to show the linear evolution of χ(k). By Equations (4), (5) and (12) we have,

x_{j} (k + 1) = x_{j} (k) (1 - τ_{j} (k) + \frac{τ_{j} (k)}{d_{j}^{+} + 1}) + \sum_{i \in N_{j}^{-}} x_{i j}^{1} (k) .

(14)

Moreover, by the definitions of x_ij, v_ij and (4) it follows,

x_{i j}^{Γ_{d}} (k + 1) = τ_{i j}^{Γ_{d}} (k) [u_{i j}^{x} (k) + \frac{x_{i} (k)}{d_{i}^{+} + 1}], x_{i j}^{l} (k + 1) = τ_{i j}^{l} (k) [u_{i j}^{x} (k) + \frac{x_{i} (k)}{d_{i}^{+} + 1}] + x_{i j}^{l + 1} (k) .

(15)

Finally, by (4) and (7) we obtain,

u_{i j}^{x} (k + 1) = (1 - \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (k)) (u_{i j}^{x} (k) + τ_{i} (k) \frac{x_{i} (k)}{d_{i}^{+} + 1}) .

(16)

Using (14) to (16) we can write the evolution of χ(k) and ψ(k) in the following linear form:

χ (k + 1) = M (k) χ (k), ψ (k + 1) = M (k) ψ (k),

(17)

where $M (k) \in ℝ^{(n + m^{'}) \times (n + m^{'})}$ is an appropriately defined matrix.

We have thus completed half of our goal: we have shown how to write RAPS as a linear update. Next, we show that the corresponding matrices are column-stochastic.

Lemma 5 M(k) is column stochastic and its positive elements are at least $1 / (\max_{i} {d_{i}^{+}} + 1)$ . Moreover, for i = 1, …,n, M_ii(k) are positive.

This lemma can be proved “by inspection.” Indeed, M(k) is column stochastic if and only if, for every χ(k), we have 1^Tχ(k + 1) = 1^Tχ(k). Thus one just needs to demonstrate that no mass is ever “lost,” i.e., that a decrease/increase in the value of one node is always accompanied by an increase/decrease of the value of another node, which can be done just by inspecting the equations. A formal proof is nonetheless given next.

Proof To show that M(k) is column stochastic, we study how each element of χ(k) influences χ(k + 1).

For i = 1, …, n, the ith column of M(k) represents how x_i(k) influences χ(k + 1). We will use (14) to (16) to find these coefficients.

First, x_i(k) influences x_i(k + 1) with the coefficient $1 - τ_{i} (k) + τ_{i} (k) / (d_{i}^{+} + 1) > 0$ . For $j \in N_{i}^{+}, x_{i} (k)$ influences $x_{i j}^{l} (k + 1)$ by $τ_{i j}^{l} (k) / (d_{i}^{+} + 1)$ and $u_{i j}^{x} (k + 1)$ with coefficient $(τ_{i} (k) - \sum_{l = 1}^{Γ_{d}} τ_{i} (k) τ_{i j}^{l} (k)) / (d_{i}^{+} + 1)$ . Summing these coefficients up results in 1.

For l = 2, …, Γ_d, $(i, j) \in E, x_{i j}^{l} (k)$ influences $x_{i j}^{l - 1} (k + 1)$ with coefficient 1 and $x_{i j}^{1} (k)$ influences $x_{j} (k + 1)$ with coefficient 1.

Finally, $u_{i j}^{x} (k)$ influences $x_{i j}^{l} (k + 1)$ with coefficient $τ_{i j}^{l} (k)$ and $u_{i j}^{x} (k + 1)$ with $(1 - \sum_{l = 1}^{d} τ_{i j}^{l} (k))$ , which sum up to 1.

Note that all the coefficients above are at least $1 / (\max_{i} {d_{i}^{+}} + 1)$ . ■

An important result of this lemma is the sum preservation property, i.e.,

\sum_{i = 1}^{n + m^{'}} χ_{i} (k) = \sum_{i = 1}^{n} x_{i} (0), \sum_{i = 1}^{n + m^{'}} ψ_{i} (k) = n .

(18)

For further analysis, we augment the graph $G$ to $H (k) ≔ G_{M (k)} = (V_{A}, E_{A} (k))$ by adding the following virtual nodes: $b_{i j}^{l}$ for l = 1, …, Γ_d and $(i, j) \in E$ , which hold the values $x_{i j}^{l}$ and $y_{i j}^{l}$ ; We also add the nodes c_ij for $(i, j) \in E$ which hold the values $u_{i j}^{x}$ and $u_{i j}^{y}$ .

In $H (k)$ , there is a link from $b_{i j}^{l}$ to $b_{i j}^{l - 1}$ for 1 < l ≤ d and from $b_{i j}^{1}$ to j as they forward their values to the next node. Moreover, if $τ_{i j}^{l} (k) = 1$ for some 1 ≤ l ≤ Γ_d, then there is a link from both c_ij and i to $b_{i j}^{l}$ .

If $τ_{i j}^{l} (k) = 0$ for 1 ≤ l ≤ Γ_d then c_ij has a self loop, and if also $τ_{i} (k) = 1$ , there’s a link from i to c_ij. All non-virtual agents $i \in V$ , have self-loops all the time (see Fig. 2).

Figure 2: — Augmented graph $H (k)$ for different scenarios.

Recursions (17) and Lemma 5 may thus be interpreted as showing that the RAPS algorithm can be thought of as a push-sum algorithm over the augmented graph sequence ${H (k)}$ , where each agent (virtual and non-virtual) holds an x-value and a y-value which evolve similarly and in parallel.

2.2. Exponential Convergence

The main result of this section is exponential convergence of RAPS to initial average, stated next.

Theorem 6 Suppose Assumption 1 holds. Then RAPS converges exponentially to the initial mean of agent values. i.e.,

| z_{i} (k) - \frac{1}{n} \sum_{i = 1}^{n} x_{i} (0) | \leq δ λ^{k} {‖ x (0) ‖}_{1},

where $δ ≔ \frac{1}{1 - n α^{6}}, λ ≔ {(1 - n α^{6})}^{1 / (2 n Γ_{s})}$ and α ≔ (1/n)^nΓ_s.

It is worth mentioning that even though $1 / (1 - λ) = O (n^{p (n)})$ where $p (n) = O (n)$ , this is a bound for a worst case scenario and on average, as it can be seen in numerical simulations, RAPS performs better. Moreover, when the graph $G$ satisfies certain properties, such as regularity, and also there is no link delays and failures, we have $1 / (1 - λ) = O (n^{3})$ (see Theorem 1 in Nedic and Olshevsky, 2016). More broadly, that paper establishes that 1/(1 − λ) will scale with the mixing rate of the underlying Markov process.

Unfortunately, this theorem does not follow immediately from standard results on exponential convergence of push-sum. The reason is that the connectivity conditions assumed for such theorems are not satisfied here: there will not always be paths leading to virtual nodes from non-virtual nodes. Nevertheless, with some suitable modifications, the existence of paths from virtual nodes to other virtual nodes is sufficient, as we will show next.

Before proving the theorem, we need the following lemmas and definitions. Given a sequence of graphs $G^{0}$ , $G^{1}$ , $G^{2}$ , …, we will say node b is reachable from node a in time period k₁ to k₂ (k₁ < k₂), if there exists a sequence of directed edges e_k₁, e_{k₁ + 1}, …, e_k₂ such that e_k is in $G^{k}$ , the destination of e_k is the origin of e_k+1 for k₁ ≤ k < k₂, and the origin of e_k₁ is a and the destination of e^k₂ is b.

Our first lemma provides a standard lower bound on the entries of the column-stochastic matrices from (17).

Lemma 7 M^{k+nΓ_s−1:k} has positive first n rows, for any k ≥ 0. The positive elements of this matrix are at least

α = {(1 / n)}^{n Γ_{s}} .

Proof By Lemma 5, each node $j \in V$ has self-loops at every iteration in the augmented graph $H$ . Since $G$ is strongly connected, the set of reachable non-virtual nodes from any node $a_{h} \in V_{A}$ strictly increases every Γ_s iterations. Hence, M^{k+nΓ_s−1:k} has positive first n rows. Moreover, since all positive elements of M are at least 1/n, the positive elements of M^{k+nΓ_s−1:k} are at least (1/n)^nΓ_s. ■

Next, we give a reformulation of the push-sum update that will be key to showing the exponential convergence of the algorithm. The proof is a minor variation of Lemma 4 in Nedic and Olshevsky (2016).

Lemma 8 Consider the vectors $u (k) \in ℝ^{d}, v (k) \in ℝ_{+}^{d}$ , and square matrix $A (k) \in ℝ_{+}^{d \times d}$ , for k ≥ 0 such that,

u (k + 1) = A (k) u (k), v (k + 1) = A (k) v (k) .

(19)

Also suppose u_i(k) = 0 if v_i(k) = 0, ∀k,i. Define $u^{-} (k) \in ℝ^{d}$ as:

u_{i}^{-} (k) ≔ {\begin{matrix} 1 / u_{i} (k), & i f u_{i} (k) \neq 0, \\ 0 & i f u_{i} (k) = 0. \end{matrix}

Define r(k) ≔ u(k) ∘ v⁻(k), where ∘ denotes the element-wise product of two vectors. Then we have,

r (k + 1) = B (k) r (k),

where $B (k) \in ℝ_{+}^{d \times d}$ is defined as,

B (k) ≔ diag (v^{-} (k + 1)) A (k) diag (v (k)) .

Proof Since u_i(k) = 0 if v_i(k) = 0, u_i(k) = r_i(k)v_i(k) holds for all i, k. Substituting in (19) we obtain,

r_{i} (k + 1) v_{i} (k + 1) = \sum_{j = 1}^{d} A_{i j} (k) r_{j} (k) v_{j} (k) .

Since, by definition r_i(k) = 0 if v_i(k) = 0, ∀k,i, we get

r_{i} (k + 1) = v_{i}^{-} (k + 1) \sum_{j = 1}^{d} A_{i j} (k) r_{j} (k) v_{j} (k) .

Therefore,

r (k + 1) = diag (v^{-} (k + 1)) A (k) diag (v (k)) r (k) .

■

Our next corollary, which follows immediately from the previous lemma, characterizes the dichotomy inherent in push-sum with virtual nodes: every row either adds up to one or zero.

Corollary 9 Consider the matrix B(k) defined in Lemma 8. Let us define the index set J^k ≔ {i|v_i(k) ≠ 0}. If i ∉ J^k, the ith column of B(k) and ith row of B(k − 1) only contain zero entries. Moreover,

B (k) 1_{d} = diag (v^{-} (k + 1)) A (k) v (k) = diag (v^{-} (k + 1)) v (k + 1) = [\begin{matrix} 1 or 0 \\ ⋮ \\ 1 or 0 \end{matrix}] .

Hence, the ith row of B(k) sums to 1 if and only if v_i(k + 1) ≠ 0 or i ∈ J^k+1.

Our next lemma characterizes the relationship between zero entries in the vectors χ(k) and ψ(k).

Lemma 10 χ_h(k) = 0 whenever ψ_h(k) = 0 for h = 1, …, n + m′, k ≥ 0.

Proof First we note that $ψ (0) = {[1_{n}^{⊺}, 0_{m^{'}}^{⊺}]}^{⊺}$ and each node $i \in V$ has a self-loop in graph $H (k)$ for all k ≥ 0; hence, ψ_h(k) ≥ 0 for all h and particularly, ψ_i(k) > 0 for i = 1, …, n. Now suppose h > n and corresponds to a virtual agent $a_{h} \in V_{A}$ . If ψ_h(k) = 0, it means a_h has already sent all its y-value to another node or has not received any y-value yet. In either case, that node also has no remaining x-value as well and χ_h(k) = 0. ■

Let us define $ψ^{-} (k) \in ℝ^{n + m^{'}}$ , k ≥ 0 by

ψ_{i}^{-} (k) ≔ {\begin{matrix} 1 / ψ_{i} (k), & if ψ_{i} (k) \neq 0, \\ 0, & if ψ_{i} (k) = 0. \end{matrix}

(20)

Moreover, we define the vector z(k) by setting z(k) ≔ χ(k) ∘ ψ⁻(k). By (17) and Lemma 10, we can use Lemma 8 to obtain,

z (k + 1) = P (k) z (k),

where P(k) ≔ diag(ψ⁻(k + 1))M(k)diag(ψ(k)). Let us define

I^{k} ≔ {i | ψ_{i} (k) > 0} .

Then, by Corollary 9 we have each z_i(k + 1), i ∈ I^k+1, is a convex combination of z_j(k)’s for j ∈ I^k. Therefore,

\max_{i \in I^{k + 1}} z_{i} (k + 1) \leq \max_{i \in I^{k}} z_{i} (k), \min_{i \in I^{k + 1}} z_{i} (k + 1) \geq \min_{i \in I^{k}} z_{i} (k) .

(21)

These equations will be key to the analysis of the algorithm. We stress that we have not shown that the quantity min_i z_i(k) is non-decreasing; rather, we have shown that the related quantity, where the minimum is taken over I^k, the set of nonzero entries of ψ(k), is non-increasing.

Our next lemma provides lower and upper bounds on the entries of the vector ψ(k).

Lemma 11 For k ≥ 0 and 1 ≤ i ≤ n we have:

n α \leq ψ_{i} (k) \leq n .

Moreover, for n + 1 ≤ h ≤ n + m′ and k ≥ 1 we have either ψ_h(k) = 0 or,

n α^{2} \leq ψ_{h} (k) \leq n .

Proof We have,

ψ (k) = M^{k - 1 : 0} [\begin{matrix} 1_{n} \\ 0_{m^{'}} \end{matrix}],

If k < nΓ_s, positive entries of M^k−1:0 are at least (1/n)^k. Hence, positive entries of ψ(k) are at least,

{(\frac{1}{n})}^{k} \geq {(\frac{1}{n})}^{n Γ_{s} - 1} = n α .

Now suppose k ≥ nΓ_s. M^k−1:0 is the product of M^{k−1:k−nΓ_s} and another column stochastic matrix. By Lemma 7, M^{k−1:k−nΓ_s} has positive first n rows, and positive entries of at least α. Thus, M^k−1:0 has positive first n rows, and positive entries of at least α as well. We obtain for 1 ≤ i ≤ n,

ψ_{i} (k) \geq n α, for k \geq 1 .

For n + 1 ≤ h ≤ n + m′m, suppose ψ_h corresponds to a virtual node a_h corresponding to some link $(i, j) \in E$ . If ψ_h(k) is positive, it is carrying a value sent from i at k − nΓ_s or later, which has experienced link failure or delays. This is because each value gets to its destination after at most Γ_s iterations. Since i has self-loops all the time, a_h is reachable from i in period k − nΓ_s to k − 1; Hence, $M_{h i}^{k - 1 : k - n Γ_{s}} \geq α$ , and it follows,

ψ_{h} (k) \geq α ψ_{i} (k - n Γ_{s}) \geq n α^{2} .

Also, due to sum preservation property, we have ψ_h(k) ≤ n, for all h and k ≥ 0. ■

Using Lemma 8 again, it follows,

z (k + n Γ_{s}) = \hat{P} (k) z (k),

where,

\hat{P} (k) ≔ diag (ψ^{-} (k + n Γ_{s})) M^{k + n L s - 1 : k} diag (ψ (k)) .

(22)

Next, we are able to find a lower bound on the positive elements of $\hat{P} (k)$ . The proof of the following corollary is immediate.

Corollary 12 By (22) and Lemma 11 we have:

${\hat{P}}_{ij} (k) > 0$ for 1 ≤ i,j ≤ n.
Positive entries of first n columns of $\hat{P} (k)$ are at least (1/n)α(nα) = α². Similarly, the last m′ columns have positive entries of at least α³.
For h > n, if h ∈ I^k+nΓ_s then ${\hat{P}}_{h i} (k) > 0$ for some 1 ≤ i ≤ n.

Our next lemma, which is the final result we need before proving the exponential convergence rate of RAPS, provides a quantitative bound for how multiplication by the matrix P shrinks the range of a vector.

Lemma 13 Let t ≥ 0 and ${u (k)}_{k \geq 0} \in ℝ^{n + m^{'}}$ be a sequence of vectors such that,

u (k + 1) = \hat{P} (k n Γ_{s} + t) u (k) .

Define

s_{t} (k) ≔ \max_{i \in I^{k n Γ_{s} + t}} u_{i} (k) - \min_{i \in I^{k n Γ_{s} + t}} u_{i} (k) .

Then,

s_{t} (k + 2) \leq (1 - n α^{6}) s_{t} (k) .

Proof Let us define

r_{t} (k) ≔ \max_{1 \leq i \leq n} u_{i} (k) - \min_{1 \leq i \leq n} u_{i} (k) .

By Corollary 12 for j ∈ I^(k+1)nΓ_s+t the jth row of $\hat{P} (k n Γ_{s} + t)$ has at least one positive entry in the first n columns. Thus, because u_j(k + 1) is maximized/minimized when all of the weight is put on the largest/smallest possible entry of u_j(k), we have:

u_{j} (k + 1) \leq α^{3} \max_{1 \leq i \leq n} u_{i} (k) + (1 - α^{3}) \max_{i \in I^{k n Γ_{s} + t}} u_{i} (k), u_{j} (k + 1) \geq α^{3} \min_{1 \leq i \leq n} u_{i} (k) + (1 - α^{3}) \min_{i \in I^{k n Γ_{s} + t}} u_{i} (k),

Therefore,

s_{t} (k + 1) \leq α^{3} r_{t} (k) + (1 - α^{3}) s_{t} (k) .

(23)

Moreover, by a similar argument for j ≤ n,

u_{j} (k + 1) \leq α^{3} \sum_{i = 1}^{n} u_{i} (k) + (1 - n α^{3}) \max_{i \in I^{k n Γ_{s} + t}} u_{i} (k), u_{j} (k + 1) \geq α^{3} \sum_{i = 1}^{n} u_{i} (k) + (1 - n α^{3}) \min_{i \in I^{k n Γ_{s} + t}} u_{i} (k) .

Thus,

r_{t} (k + 1) \leq (1 - n α^{3}) s_{t} (k) .

Combining with (23) and noting that r_t(k) ≤ s_t(k) and s_t(k + 1) ≤ s_t(k) we obtain,

s_{t} (k + 2) \leq α^{3} (1 - n α^{3}) s_{t} (k) + (1 - α^{3}) s_{t} (k + 1) \leq α^{3} (1 - n α^{3}) s_{t} (k) + (1 - α^{3}) s_{t} (k) = (1 - n α^{6}) s_{t} (k) . ■

Proof of Theorem 6 Using Lemma 13 with t = 0 and u(k) = z(knΓ_s) we get s₀(k) ≤ (1 − nα⁶)^⌊k/2⌋s₀(0) and lim_k→∞ s₀(k) = 0. Moreover by (21), z_max(k) is a non-increasing sequence and by z_min(k) is non-decreasing. Thus,

\lim_{k \to \infty, h \in I^{k}} z_{h} (k) = L_{\infty} .

(24)

We have:

L_{\infty} = L_{\infty} \lim_{k \to \infty} \frac{\sum_{i = 1}^{n + m^{'}} ψ_{i} (k)}{\sum_{i = 1}^{n + m^{'}} ψ_{i} (k)} = \lim_{k \to \infty} (\frac{\sum_{i = 1}^{n + m^{'}} z_{i} (k) ψ_{i} (k)}{n} + \frac{\sum_{i = 1}^{n + m^{'}} (L_{\infty} - z_{i} (k)) ψ_{i} (k)}{n}) = \lim_{k \to \infty} (\frac{\sum_{i = 1}^{n + m^{'}} χ_{i} (k)}{n} + \frac{\sum_{i = 1}^{n + m^{'}} (L_{\infty} - z_{i} (k)) ψ_{i} (k)}{n}) = \frac{\sum_{i = 1}^{n} x_{i} (0)}{n} .

In the above, we used (18) and (24), the boundedness of ψ_i(k), and the fact that ψ_i(k) = 0 for i ∉ I^k.

Finally, to show the exponential convergence rate, we go back to s₀(k). We have for k ≥ 1,

s_{0} (k) \leq {(1 - n α^{6})}^{⌊ k / 2 ⌋} s_{0} (0) \leq {(1 - n α^{6})}^{(k - 1) / 2} s_{0} (0), s_{0} (0) \leq \sum_{i = 1}^{n + m^{'}} | z_{i} (0) | = \sum_{i = 1}^{n} | x_{i} (0) | = {‖ x (0) ‖}_{1},

where the first equality holds because I⁰ = {1, …, n} and y_i(0) = 1. Therefore, we have for i ∈ I^k,

∣ z_{i} (k) - \frac{1^{⊺} x (0)}{n} ∣ \leq z_{\max} (k) - z_{\min} (k) \leq s_{0} (⌊ k ∕ n Γ_{s} ⌋) \leq {(1 - n α^{6})}^{(⌊ \frac{k}{n Γ_{s}} ⌋ - 1) ∕ 2} {∥ x (0) ∥}_{1} \leq {(1 - n α^{6})}^{(\frac{k}{n Γ_{s}} - 1 - 1) ∕ 2} {∥ x (0) ∥}_{1} = \frac{1}{1 - n α^{6}} {({(1 - n α^{6})}^{1 ∕ (2 n Γ_{s})})}^{k} {∥ x (0) ∥}_{1} = δ λ^{k} {∥ x (0) ∥}_{1} .

where $δ = \frac{1}{1 - n α^{6}}$ and λ = (1 − nα⁶)^1/(2nΓ_s). Note that {1, …, n} ⊆ I^k, ∀k. ■

Remark: Observe that our proof did not really use the initialization ψ(0) = 1, except to observe that the elements ψ(0) are positive, add up to n, and the implication that ψ(k) satisfies the bounds of Lemma 11. In particular, the same result would hold if we viewed time 1 as the initial point of the algorithm (so that ψ(1) is the initialization), or similarly any time k. We will use this observation in the next subsection.

2.3. Perturbed Push-Sum

In this subsection, we begin by introducing the Perturbed Robust Asynchronous Push-Sum algorithm, obtained by adding a perturbation to the x-values of (non-virtual) agents at the beginning of every iteration they wake up.

We show that, if the perturbations are bounded, the resulting z(k) nevertheless tracks the average of χ(k) pretty well. Such a result is a key step towards analyzing distributed optimization protocols. In this general approach to the analyses of distributed optimization methods, we follow Ram et al. (2010) where it was first adopted; see also Nedic and Olshevsky (2016) and Nedic and Olshevsky (2015) where it was used.

Adopting the notations introduced earlier and by the linear formulation (17) we have,

χ (k + 1) = M (k) (χ (k) + Δ (k)), for k \geq 0,

Algorithm 2.

Perturbed Robust Asynchronous Push-Sum

1:	Initialize the algorithm with y(0) = 1 ϕ_i(0) = 0, ∀i ∈ i ∈ {1, …, n} and ρ_ij(0) = 0, κ_ij(0) = 0, $\forall (j, i) \in E$ and Δ(0) = 0.
2:	At every iteration k = 0, 1, 2, …, for every node i:
3:	if node i wakes up then
4:	x_i ← x_i + Δ_i(k)
5:	Lines 4 to 17 of Algorithm 1
6:	end if
7:	Other variables remain unchanged.

Open in a new tab

where $Δ (k) \in ℝ^{n + m^{'}}$ collects all perturbations Δ_i(k) in a column vector with Δ_h(k) ≔ 0 for n < h ≤ n + m’. We may write this in a convenient form as follows.

χ (k + 1) = M (k) (χ (k) + Δ (k)) = \sum_{t = 1}^{k} M^{k : t} Δ (k) + M^{k : 0} χ (0) .

Define for k ≥ 1,

\begin{array}{l} χ^{t} (k) ≔ M^{k - 1 : t} Δ (t), & 1 \leq t \leq k, \\ χ^{0} (k) ≔ M^{k - 1 : 0} χ (0), & t = 0 . \end{array}

(25)

We obtain

χ (k) = \sum_{t = 0}^{k - 1} χ^{t} (k), k \geq 1 .

(26)

Define z^t(k) ≔ χ^t(k) ∘ ψ⁻(k) for 0 ≤ t ≤ k (cf. Equation 20). We have

z (k) = \sum_{t = 0}^{k - 1} z^{t} (k) .

(27)

We may view each z^t(k) as the outcome of a push-sum algorithm, initialized at time t, and apply Theorem 6. This immediately yields the following result, with part (b) an immediate consequence of part (a).

Theorem 14 Suppose Assumption 1 holds. Consider the sequence {z_i(k)}, 1 ≤ i ≤ n, generated by Algorithm 2. Then,

For k = 1,2,….
$| z_{i} (k) - \frac{1^{⊺} χ (k)}{n} | \leq δ λ^{k} {‖ x (0) ‖}_{1} + \sum_{t = 1}^{k - 1} δ λ^{k - t} {‖ Δ (t) ‖}_{1} .$
If lim_t→∞ ‖Δ(t)‖₁ = 0 then,
$lim_{k \to \infty} | z_{i} (k) - \frac{1^{⊺} χ (k)}{n} | = 0 .$

3. Robust Asynchronous Stochastic Gradient-Push (RASGP)

In this section we present the main contribution of this paper, a distributed stochastic gradient method with asymptotically network-independent and optimal performance over directed graphs which is robust to asynchrony, delays, and link failures.

Recall that we are considering a network $G$ of n agents whose goal is to cooperatively solve the following minimization problem

minimize F (z) ≔ \sum_{i = 1}^{n} f_{i} (z), over z \in ℝ^{d},

where each $f_{i} : ℝ^{d} \to ℝ$ is a strongly convex function only known to agent i. We assume agent i has the ability to obtain noisy gradients of the function f_i.

The RASGP algorithm is given as Algorithm 3. Note that we use the notation ${\hat{g}}_{i} (k)$ for a noisy gradient of the function f_i(z) at z_i(k) i.e.,

{\hat{g}}_{i} (k) = g_{i} (k) + ε_{i},

where g_i(k) ≔ ∇f_i(z_i(k)) and ε_i is a random vector.

The RASGP is based on a standard idea of mixing consensus and gradient steps, first analyzed in Nedic and Ozdaglar (2009). The push-sum scheme of Section 2, inspired by Hadjicostis et al. (2016), is used instead of the consensus scheme, which allows us to handle delays, asynchronicity, and message losses; this is similar to the approach taken in Nedic and Olshevsky (2015). We note that a new step-size strategy is used to handle asynchronicity: when a node wakes up, it takes steps with a step-size proportional to the sum of all the step-sizes during the period it slept. As far as we are aware, this idea is new.

We will be making the following assumption on the noise vectors.

Assumption 2 ε_i is an independent random vector with bounded support, i.e., ‖ε_i‖ ≤ b_i, i = 1, …, n. Moreover, $E [ε_{i}] = 0$ and $E [{‖ ε_{i} ‖}^{2}] \leq σ_{i}^{2}$ .

Next, we state and prove the main result of this paper, which states the linear convergence rate of Algorithm 3.

Theorem 15 Suppose that:

Assumptions 1 and 2 hold.
Each objective function f_i(z) is μ_i-strongly convex over $ℝ^{d}$ .
The gradients of each f_i(z) are L_i-Lipschitz continuous, i.e., for all $z_{1}, z_{2} \in ℝ^{d}$ ,
$‖ g_{i} (z_{1}) - g_{i} (z_{2}) ‖ \leq L_{i} ‖ z_{1} - z_{2} ‖ .$

Then, the RASGP algorithm with the step-size α(k) = n/(μk) for k ≥ 1 and α(0) = 0, will converge to the unique optimum z* with the following asymptotic rate: for all i = 1, …, n, we have

E [{‖ z_{i} (k) - z^{*} ‖}^{2}] \leq \frac{Γ_{u} σ^{2}}{k μ^{2}} + O_{k} (\frac{1}{k^{1.5}}),

where $σ^{2} ≔ \sum_{i} σ_{i}^{2}$ , μ = ∑_i μ_i.

Algorithm 3.

Robust Asynchronous Stochastic Gradient-Push (RASGP)

1:	Initialize the algorithm with y(0) = 1, $ϕ_{i}^{x} (0) = 0$ , $ϕ_{i}^{y} (0) = 0$ , κ_i(0) = −1, ∀i ∈ {1, …, n} and $ρ_{ij}^{x} (0) = 0$ , $ρ_{ij}^{y} (0) = 0$ , κ_ij(0) = −1, $\forall (j, i) \in E$ .
2:	At every iteration k = 0, 1, 2, …, for every node i:
3:	if node i wakes up then
4:	$β_{i} (k) = \sum_{t = κ_{i} + 1}^{k} α (t)$ ;
5:	$x_{i} \leftarrow x_{i} - β_{i} (k) {\hat{g}}_{i} (k)$ ;
6:	κ_i ← k;
7:	$ϕ_{i}^{x} \leftarrow ϕ_{i}^{x} + \frac{x_{i}}{d_{i}^{+} + 1}$ , $ϕ_{i}^{y} \leftarrow ϕ_{i}^{y} + \frac{y_{i}}{d_{i}^{+} + 1}$ ;
8:	$x_{i} \leftarrow \frac{x_{i}}{d_{i}^{+} + 1}$ , $y_{i} \leftarrow \frac{y_{i}}{d_{i}^{+} + 1}$ ;
9:	Node i broadcasts ( $ϕ_{i}^{x}$ , $ϕ_{i}^{y}$ , κ_i) to its out-neighbors: $N_{i}^{+}$
10:	Processing the received messages
11:	for $(ϕ_{j}^{x}, ϕ_{j}^{y}, κ_{j}^{'})$ in the inbox do
12:	if $κ_{j}^{'} > κ_{i j}$ then
13:	$ρ_{i j}^{* x} \leftarrow ϕ_{j}^{x}$ , $ρ_{i j}^{* y} \leftarrow ϕ_{j}^{y}$ ;
14:	$κ_{i j} \leftarrow κ^{'}$
15:	end if
16:	end for
17:	$x_{i} \leftarrow x_{i} + \sum_{j \in N_{i}^{-}} (ρ_{ij}^{* x} - ρ_{ij}^{x})$ , $y_{i} \leftarrow y_{i} + \sum_{j \in N_{i}^{-}} (ρ_{ij}^{* y} - ρ_{ij}^{y})$ ;
18:	$ρ_{i j}^{x} \leftarrow ρ_{i j}^{* x}$ , $ρ_{i j}^{y} \leftarrow ρ_{i j}^{* y}$ ;
19:	$z_{i} \leftarrow \frac{x_{i}}{y_{i}}$ ;
20:	end if
21:	Other variables remain unchanged.

Open in a new tab

Remark 16 We note that each agent stores variables x_i, y_i, κ_i, z_i, $ϕ_{i}^{x}$ , $ϕ_{i}^{y}$ and $ρ_{ij}^{x}$ , $ρ_{ij}^{y}$ , κ_ij for all in-neighbors $j \in N_{i}^{-}$ . Hence, the memory requirement of the RASGP algorithm for each agent is $O (d_{i}^{-})$ for each agent i.

We next turn to the proof of Theorem 15. First, we observe that Algorithm 3 is a specific case of multi-dimensional Perturbed Robust Asynchronous Push-Sum. In other words, each coordinate of vectors x_i, z_i, $ϕ_{i}^{x}$ and $ρ_{ij}^{x}$ will experience an instance of Algorithm 2. Hence, there exists an augmented graph sequence ${H (k)}$ where the Algorithm 3 is equivalent to perturbed push-sum consensus on $H (k)$ where each agent $a_{h} \in V_{A}$ holds vectors x_h and y_h. In other words, we will be able to apply Theorem 14 to analyze Algorithm 3.

Our first step is to show how to decouple the action of Algorithm 3 coordinate by coordinate. For each coordinate 1 ≤ ℓ ≤ d, let $χ^{l} \in ℝ^{n + m^{'}}$ stack up the ℓth entries of x-values of all agents (virtual and non-virtual) in $V_{A}$ . Additionally, define $Δ^{ℓ} (k) \in ℝ^{n + m^{'}}$ to be the vector stacking up the ℓth entries of perturbations. i.e.,

{[Δ^{ℓ} (k)]}_{i} ≔ {\begin{matrix} - β_{i} (k) {[{\hat{g}}_{i} (k)]}_{ℓ}, & if i \in V, τ_{i} (k) = 1, \\ 0, & otherwise . \end{matrix}

Then, by the definition of the algorithm, we have for all ℓ = 1, …, d,

\begin{array}{l} χ^{ℓ} (k + 1) = M (k) (χ^{ℓ} (k) + Δ^{ℓ} (k)), \\ ψ (k + 1) = M (k) ψ (k) . \end{array}

(28)

These equations write out the action of Algorithm 3 on a coordinate-by-coordinate basis.

In order to prove Theorem 15, we need a few tools and lemmas. As already mentioned, our first step will be to argue that Algorithm 3 converges by application of Theorem 14. This requires showing the boundedness of the perturbations Δ^ℓ(k), which, as we will show, reduces to showing the vectors z_i(k) are bounded. The following lemma will be useful to establish this boundedness.

Lemma 17 (Nedic and Olshevsky, 2016, Lemma 3) Let $q : ℝ^{d} \to ℝ$ be a ν-strongly convex function with ν > 0 which has Lipschitz gradients with constant L. let $v \in ℝ^{d}$ and $u \in ℝ^{d}$ defined by,

u = v - α (\nabla q (v) + p (v)),

where α ∈ (0,ν/8L²] and $p : ℝ^{d} \to ℝ^{d}$ is a mapping such that,

‖ p (v) ‖ \leq c, for all v \in R^{d} .

Then, there exists a compact set $S \subset ℝ^{d}$ and a scalar R such that,

‖ u ‖ \leq {\begin{matrix} ‖ v ‖, & for all v \notin S, \\ R, & for all v \in S, \end{matrix}

where,

S ≔ {z | q (z) \leq q (0) + 2 \frac{ν}{8 L^{2}} ({‖ q (0) ‖}^{2} + c^{2})} \cup B (0, \frac{4 c}{ν}), R ≔ \max_{z \in S} {‖ z ‖ + \frac{ν}{8 L^{2}} ‖ \nabla q (z) ‖} + \frac{ν c}{8 L^{2}} .

We now argue that the iterates generated by Algorithm 3 are bounded.

Lemma 18 The iterates z_i(k) generated by Algorithm 3 will remain bounded.

Proof Let us adopt the notation ψ⁻ from previous sections and define $z^{l} (k) ≔ χ^{l} (k) \circ ψ^{-} (k) \in ℝ^{n + m^{'}}$ . Moreover, adopt the notation z_h for virtual agent a_h, h = n + 1, …, n+m′, as z_h(k) ≔ x_h(k)/ψ_h(k). Also define $u^{l} \in ℝ^{n + m^{'}}$ by

u^{ℓ} (k) ≔ χ^{ℓ} (k) + Δ^{ℓ} (k) .

Since the perturbations are only added to the non-virtual agents, which have strictly positive y-values, we conclude [u^ℓ(k)]_h = 0 if ψ_h(k) = 0. Hence, the assumptions of Lemma 8 and Corollary 9 are satisfied. Adopting the definition of I^k and P(k) from previous sections, we get for i ∈ I^k+1.

{[z^{ℓ} (k + 1)]}_{i} = \sum_{j \in I^{k}} P_{i j} (k) \frac{{[u^{ℓ} (k)]}_{j}}{ψ_{j} (k)} .

Combining the equation above for ℓ = 1,…d we obtain:

z_{i} (k + 1) = \sum_{j \in I^{k}} P_{i j} (k) \frac{u_{j} (k)}{ψ_{j} (k)},

(29)

where $u_{j} (k) \in ℝ^{d}$ is created by collecting the jth entries of all u^ℓ(k), i.e.,

u_{i} (k) = {\begin{matrix} x_{i} (k) - β_{i} (k) {\hat{g}}_{i} (k), & if i \in V and τ_{i} (k) = 1, \\ x_{i} (k), & otherwise . \end{matrix}

Now consider each term on the right hand side of (29) for j ∈ I^k. Suppose j ≤ n and τ_j(k) = 1, then we have:

\frac{u_{j} (k)}{y_{j} (k)} = z_{j} (k) - \frac{β_{j} (k)}{y_{j} (k)} (\nabla f_{j} (z_{j} (k)) + ε_{j} (k)) .

Since lim_k→∞ α(k) = 0 and k − κ_i(k) = 0 and k − κ_i(k) ≤ Γ_u, lim_k→∞ β_j(k) = 0. Moreover, by Lemma 11, y_j(k) is bounded below; thus, lim_k→∞ β_j(k)/y_j(k) = 0 and there exists k_j such that for k ≥ k_j, $, β_{j} (k) / y_{j} (k) \in (0, μ_{j} / 8 L_{j}^{2}]$ . Applying Lemma 17, it follows that for each j there exists a compact set $S_{j}$ and a scalar R_j such that for k ≥ k_j, if τ_j(k) = 1,

‖ \frac{u_{j} (k)}{y_{j} (k)} ‖ \leq {\begin{matrix} ‖ z_{j} (k) ‖ \\ R_{j}, \end{matrix} \begin{matrix} {if z}_{j} (k) \notin S_{j}, \\ {if z}_{j} (k) \in S_{j} . \end{matrix}

(30)

Moreover, if τ_j(k) = 0 or j > n we have,

\frac{u_{j} (k)}{y_{j} (k)} = z_{j} (k) .

(31)

Let k_z ≔ max_i k_i. Using mathematical induction, we will show that for all k ≥ k_z:

\max_{i \in I^{k}} ‖ z_{i} (k) ‖ \leq \bar{R},

(32)

where $\bar{R} ≔ \max {\max_{i} R_{i}, \max_{j \in I^{k_{z}}} ‖ z_{j} (k_{z}) ‖}$ . Equation (32) holds for k = k_z. Suppose it is true for some k ≥ k_z. Then by (30) and (31) we have,

‖ \frac{u_{i} (k)}{y_{i} (k)} ‖ \leq \max {R_{i}, ‖ z_{i} (k) ‖} \leq \bar{R} .

(33)

Also by (29), for i ∈ I^k+1, z_i(k + 1) is a convex combination of u_j(k)/y_j(k)’s, where j ∈ I^k. Hence,

‖ z_{i} (k + 1) ‖ \leq \sum_{j \in I^{k}} P_{i j} ‖ \frac{u_{j} (k)}{ψ_{j} (k)} ‖ \leq \bar{R} .

Define $B_{z} ≔ \max {\bar{R}, \max_{i \in I^{k}, k < k_{z}} ‖ z_{i} (k) ‖}$ and we have ‖z_i(k)‖ ≤ B_z, ∀k ≥ 0. ■

We next explore a convenient way to rewrite Algorithm 3. Let us introduce the quantity w_i(k), which can be interpreted as the x-value of agent i, if it performed a gradient step at every iteration, even when asleep:

w_{i} (k) = {\begin{matrix} x_{i} (k) - (\sum_{t = κ_{i} (k) + 1}^{k - 1} α (t)) g_{i} (k), & if i \in V, \\ x_{i} (k), & otherwise . \end{matrix}

(34)

Also, define $w^{ℓ} \in ℝ^{n + m^{'}}$ by collecting the ℓth dimension of all w_i’s and $\bar{w} (k) ≔ (\sum_{i = 1}^{n + m^{'}} w_{i} (k)) / n$ . Moreover, define $g^{ℓ} \in ℝ^{n + m^{'}}$ by collecting the ℓth value of gradients of all agents (0 for virtual agents), i.e.,

{[g^{ℓ} (k)]}_{i} = {\begin{matrix} {[g_{i} (k)]}_{ℓ} & if i \in V, \\ 0, & otherwise . \end{matrix}

Additionally, define ${\hat{ε}}_{i} (k) \in ℝ^{d}$ as the noise injected to the system at time k by agent i, i.e.,

{\hat{ε}}_{i} (k) = {\begin{matrix} β_{i} (k) ε_{i} (k), & if i \in V and τ_{i} (k) = 1, \\ 0, & otherwise, \end{matrix}

and ${\hat{ε}}^{l} (k) \in ℝ^{n + m^{'}}$ as the vector collecting the ℓth values of all ${\hat{ε}}_{i} (k)$ ’s.

We then have the following lemma.

Lemma 19

w^{ℓ} (k + 1) = M (k) (w^{ℓ} (k) - α (k) g^{ℓ} (k) - {\hat{ε}}^{ℓ}) .

(35)

Proof We consider two cases:

If τ_i(k) = 0, then (35) reduces to w_i(k + 1) = w_i(k) − α(k)g_i(k); noting that, because node i did not update at time k we have that g_i(k) = g_i(k + 1) and this is the correct update.
For all other nodes (i.e., for both virtual nodes and nodes with τ_i(k) = 1, we have ${[w^{ℓ} (k) - α (k) {\hat{g}}^{ℓ} (k) - {\hat{ε}}^{ℓ} (k)]}_{i} = {[χ^{ℓ} (k) + Δ^{ℓ} (k)]}_{i}$ in (28). Since χ^ℓ(k+1) = M(k)(χ^ℓ(k) + Δ^ℓ(k)) and, using the definition of w_i(k), we have that for these nodes,
$w_{i} (k + 1) = x_{i} (k + 1);$

(28) implies the conclusion. ■

This lemma allows us to straightforwardly analyze how the average of w(k) evolves. Indeed, summing all the elements of (35) and dividing by n for each ℓ = 1, …, d we obtain,

\bar{w} (k + 1) = \bar{w} (k) - \frac{α (k)}{n} \sum_{i = 1}^{n} g_{i} (k) - \frac{1}{n} \sum_{i = 1}^{n} {\hat{ε}}_{i} (k) = \bar{w} (k) - \frac{α (k)}{n} \sum_{i = 1}^{n} \nabla f_{i} {(\bar{w} (k))}_{i} - \frac{1}{n} \sum_{i = 1}^{n} {\hat{ε}}_{i} (k) - \frac{α (k)}{n} \sum_{i = 1}^{n} (g_{i} (k) - \nabla f (\bar{w} (k))) .

(36)

We next give a sequence of lemmas to the effect that all the quantities generated by the algorithm are close to each other over time. Define,

\bar{x} (k) = \frac{1}{n} \sum_{a_{h} \in V_{A}} x_{h} (k) .

where, recall, $V_{A}$ is our notation for all the nodes in the augmented graph (i.e., including virtual nodes). Moreover, we will extend the definition of from Line 4 of Algorithm 3 to all k via the same formula $β_{i} (k) ≔ \sum_{t = κ_{i} (k) + 1}^{k} α (t)$ . Our first lemma will show that each z_i(k) closely tracks $\bar{x} (k)$ .

Lemma 20 Using Algorithm 3 with α(k) = n/(kμ), under the assumptions of Theorem 15, we have for each i, $‖ z_{i} (k + 1) - \bar{x} (k + 1) ‖ = O_{k} (1 / k)$ .

Proof By Theorem 14(a) we have for each ℓ,

| {[z^{ℓ} (k + 1)]}_{i} - \frac{1^{⊺} χ^{ℓ} (k + 1)}{n} | \leq δ λ^{k} {‖ χ^{ℓ} (0) ‖}_{1} + \sum_{t = 1}^{k} δ λ^{k - t} {‖ Δ^{ℓ} (t) ‖}_{1} .

Summing the above inequality for ℓ = 1, …, d we obtain,

{‖ z_{i} (k + 1) - \bar{x} (k + 1) ‖}_{1} \leq \sum_{j = 1}^{n} (δ λ^{k} {‖ x_{j} (0) ‖}_{1} + \sum_{t = 1}^{k} δ λ^{k - t} β_{i} (t) τ_{i} (t) {‖ {\hat{g}}_{j} (t) ‖}_{1}) .

Moreover,

β_{i} (k) = \sum_{t = κ_{i} (k) + 1}^{k} \frac{n}{μ t} \leq \frac{n}{μ} (\frac{k - κ_{i} (k)}{κ_{i} (k) + 1}) .

(37)

But,

κ_{i} (k) < k \leq κ_{i} (k) + Γ_{u} .

Since Γ_u ≥ 1, we obtain

k \leq (κ_{i} (k) + 1) Γ_{u},

or,

\frac{1}{κ_{i} (k) + 1} \leq \frac{Γ_{u}}{k} .

Thus, from (37) we have,

β_{i} (k) \leq \frac{n Γ_{u}^{2}}{μ k} .

(38)

Define,

M_{j} ≔ \max_{‖ z ‖ \leq B_{z}} {‖ g_{j} (z) ‖}_{1},

(39)

and observe that M_j is finite by Lemma 18. Also τ_j(k) ≤ 1. We obtain,

{‖ z_{i} (k + 1) - \bar{x} (k + 1) ‖}_{1} \leq \sum_{j = 1}^{n} (δ λ^{k} {‖ x_{j} (0) ‖}_{1} + \sum_{t = 1}^{k} δ λ^{k - t} \frac{n Γ_{u}^{2}}{μ t} (M_{j} + b_{j})) .

Let RHS denote the right hand side of the relation above. We have,

R H S = \sum_{j = 1}^{n} (δ λ^{k} {‖ x_{j} (0) ‖}_{1} + \frac{δ n Γ_{u}^{2}}{μ} (M_{j} + b_{j}) (\sum_{t = 1}^{⌊ \frac{k}{2} ⌋} \frac{λ^{k - t}}{t} + \sum_{t = ⌊ \frac{k}{2} ⌋ + 1}^{k} \frac{λ^{k - t}}{t})) \leq \sum_{j = 1}^{n} (δ λ^{k} {‖ x_{j} (0) ‖}_{1} + \frac{δ n Γ_{u}^{2}}{μ} (M_{j} + b_{j}) (\frac{k}{2} λ^{\frac{k}{2}} + \frac{2}{(1 - λ) k})) = O_{k} (\frac{1}{k}),

where we used the following relations,

\sum_{t = 1}^{⌊ \frac{k}{2} ⌋} \frac{λ^{k - t}}{t} \leq ⌊ \frac{k}{2} ⌋ λ^{k - ⌊ \frac{k}{2} ⌋} \leq \frac{k}{2} λ^{\frac{k}{2}}, \sum_{t = ⌊ \frac{k}{2} ⌋ + 1}^{k} \frac{λ^{k - t}}{t} \leq \sum_{t = 0}^{⌈ \frac{k}{2} ⌉ - 1} \frac{λ^{t}}{⌊ \frac{k}{2} ⌋ + 1} \leq \frac{2}{(1 - λ) k} .

Finally, ‖v‖₂ ≤ ‖v‖₁ for all vectors v, completes the proof. ■

An immediate consequence of this lemma is that the quantities $\bar{x} (k)$ and $\bar{w} (k)$ are close to each other.

Lemma 21 Using Algorithm 3 with α(k) = n/(kμ), under the assumptions of Theorem 15, we have, $‖ \bar{x} (k) - \bar{w} (k) ‖ = O_{k} (1 / k)$ .

Proof By definition of $\overset{‒}{w}$ we have,

\bar{x} (k) - \bar{w} (k) = \frac{1}{n} \sum_{i = 1}^{n} (\sum_{t = κ_{i} (k) + 1}^{k - 1} α (t)) g_{i} (k) .

Using (38) we have,

‖ \bar{x} (k) - \bar{w} (k) ‖ \leq \frac{1}{n} \sum_{i = 1}^{n} β_{i} (k) M_{i} \leq \sum_{i = 1}^{n} \frac{Γ_{u}^{2} M_{i}}{n μ k} = O_{k} (\frac{1}{k}),

where M_i was defined through (39). ■

We next remark on a couple of implications of the past series of lemmas.

Corollary 22 We have $‖ z_{i} (k) - \bar{w} (k) ‖ = O_{k} (\frac{1}{k})$ .

Lemma 23 $‖ g_{i} (k) - \nabla f_{i} (\bar{w} (k)) ‖ = O_{k} (\frac{1}{k})$ .

Proof Since ∇f_i is L_i-Lipschitz, we have,

‖ g_{i} (k) - \nabla f_{i} (\bar{w} (k)) ‖ \leq L_{i} ‖ z_{i} (k) - \bar{w} (k) ‖ .

Using Corollary 22, the lemma is proved. ■

We are now in a position to rewrite Algorithm 3 as a sort of perturbed gradient descent. Let us define,

η (k) ≔ \frac{1}{μ k} \sum_{i = 1}^{n} (g_{i} (k) - \nabla f_{i} (\bar{w} (k))) .

By Lemma 23, $η (k) = O_{k} (1 / k^{2})$ . Therefore, there exists B_η such that η(k) ≤ B_η/k² for all k ≥ 1.

By (36) we have,

\bar{w} (k + 1) = \bar{w} (k) - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) - \bar{η} (k),

(40)

where

The function $F ≔ \sum_{i = 1}^{n} f_{i} \in ℝ^{d} \to ℝ$ is μ-strongly-convex with L-Lipschitz gradient, where $L ≔ \sum_{i = 1}^{n} L_{i}$ .
The noise $\bar{ε} (k) ≔ (\sum_{i = 1}^{n} {\hat{ε}}_{i} (k)) / n$ is bounded (i.e., $\bar{ε} (k) \in B (0, r_{e})$ ), with probability one, where r_e ≔ (Γ_u/μ)∑_j b_j), and $E [\bar{ε} (k)] = 0$ .

In other words, with the exception of the η(k) term, what we have is exactly a stochastic gradient descent method on the function F(⋅).

The following lemmas bound $\bar{ε} (k)$ . Let us define ν_i(k) = k − κ_i(k) as the number of iterations agent i has skipped since it’s last update. By Assumption 1, ν_i(k) ≤ Γ_u.

Lemma 24 We have $β_{i} (k) = O_{k} (1 / k)$ , ∀i. Moreover,

β_{i} (k) \leq \frac{n ν_{i} (k)}{μ k} + O_{k} (k^{- 2}) .

Proof Since ν_i(k) ≤ Γ_u, ∀i, we have for κ_i(k) ≥ 1,

β_{i} (k) = \sum_{t = κ_{i} (k) + 1}^{k} \frac{n}{μ t} \leq \frac{n}{μ} \ln (\frac{k}{κ_{i} (k)}) \leq \frac{n}{μ} \ln (\frac{k}{k - ν_{i} (k)}) = \frac{n}{μ} \ln (\frac{ν_{i} (k)}{k - ν_{i} (k)}) \leq \frac{n ν_{i} (k)}{μ (k - ν_{i} (k))} = \frac{n ν_{i} (k)}{μ k} + O_{k} (k^{- 2}) . ■

Corrllary 25 $μ k ‖ \bar{ε} (k) ‖$ is bounded.

Lemma 26 There exists B_ϵ > 0 such that We have,

E [{‖ \bar{ε} (k) ‖}^{2}] \leq \frac{Γ_{u}^{2}}{μ^{2} k^{2}} σ^{2} + \frac{B_{ϵ}}{k^{4}} .

Proof Using Lemma 24, we have for k > Γ_u,

E [{‖ \bar{ε} (k) ‖}^{2}] = E [{‖ \frac{1}{n} \sum_{i = 1}^{n} β_{i} (k) ε_{i} (k) τ_{i} (k) ‖}^{2}] = \frac{1}{n^{2}} \sum_{i = 1}^{n} β_{i}^{2} (k) E [{‖ ε_{i} (k) ‖}^{2}] \leq \frac{1}{n^{2}} \sum_{i = 1}^{n} β_{i}^{2} (k) σ_{i}^{2} \leq \frac{Γ_{u}^{2}}{μ^{2} k^{2}} σ^{2} + O_{k} (k^{- 4}),

where the second equality is the result of the noise terms being independent and zero-mean. ■

Our next observation is a technical lemma which is essentially a rephrasing of Lemma 17 above.

Lemma 27 There exists a constant B_w and time k_w such that $‖ \bar{w} (k) ‖ \leq B_{w}$ with probability one, for k ≥ k_w.

Proof We have

\bar{w} (k + 1) = \bar{w} (k) - \frac{1}{μ k} [\nabla F (\bar{w} (k)) + μ k (\bar{ε} (k) + η (k))],

where $μ k ‖ \bar{ε} (k) + η (k) ‖$ is bounded. Moreover, there exists k_w such that for k ≥ k_w, $\frac{1}{μ k} \in (0, μ / 8 L^{2}]$ . Therefore, by Lemma 17 there exists a compact set $S_{w}$ and a scalar R_w > 0 such that for k ≥ k_w,

‖ \bar{w} (k + 1) ‖ \leq {\begin{matrix} ‖ \bar{w} (k) ‖, & for \bar{w} \in S_{w}, \\ R_{w}, & for \bar{w} \in S_{w} . \end{matrix}

Therefore, setting $B_{w} ≔ \max {R_{w}, ‖ \bar{w} (k_{w}) ‖}$ will complete the proof. ■

As a consequence of this lemma, because ${‖ η (k) ‖}_{2} \leq B_{η}$ , this lemma implies there is a constant B₁ such that for k ≥ k_w,

‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖ \leq B_{1},

(41)

with probability one. This now puts us in a position to show that $\bar{w} (k)$ converges in mean square to the optimal solution.

Lemma 28 $E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] \to 0$ .

Proof Using the definition of k_w from Lemma 27, we have that for k ≥ k_w,

E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \leq E [{‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖}^{2} + 2 ‖ η (k) ‖ ‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖ + {‖ η (k) ‖}^{2}] .

We will bound each of the terms on the right. We begin with the easiest one, which is the last one:

{‖ η (k) ‖}^{2} \leq \frac{B_{η}^{2}}{k^{4}} .

(42)

The middle term is bounded as

2 ‖ η (k) ‖ ‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖ \leq \frac{2 B_{η} B_{1}}{k^{2}},

(43)

where we used (41).

Finally, we turn to the first term which we denote by T₁:

T_{1} \leq E {‖ \bar{w} (k) - z^{*} ‖}^{2} - \frac{2}{μ k} E [\nabla F {(\bar{w} (k))}^{⊺} (\bar{w} (k) - z^{*})] + \frac{L^{2}}{μ^{2} k^{2}} E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] + E [{‖ \bar{ε} (k) ‖}^{2}],

where we used the usual inequality ${‖ \nabla F (\bar{w} (k)) ‖}^{2} \leq L^{2} {‖ \bar{w} (k) - z^{*} ‖}^{2}$ which follows from ∇F(⋅) being L-Lipschitz. Now, using the standard inequality

\nabla F {(\bar{w} (k))}^{T} (\bar{w} (k) - z^{*}) \geq F (\bar{w} (k)) - F (z^{*}) + \frac{μ}{2} {‖ \bar{w} (k) - z^{*} ‖}^{2} \geq μ {‖ \bar{w} (k) - z^{*} ‖}^{2},

and Lemma 26 we obtain,

T_{1} \leq (1 - \frac{2}{k} + \frac{L^{2}}{μ^{2} k^{2}}) E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] + \frac{Γ_{u}^{2}}{μ^{2} k^{2}} σ^{2} + \frac{B_{ϵ}}{k^{4}} .

(44)

Now putting together (42), (43), and (44), we get,

E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \leq (1 - \frac{2}{k} + \frac{L^{2}}{μ^{2} k^{2}}) E [{‖ \bar{w} (k) - x^{*} ‖}^{2}] + \frac{Γ_{u}^{2} σ^{2}}{μ^{2} k^{2}} + \frac{2 B_{η} B_{1}}{k^{2}} + \frac{B_{η}^{2} + B_{ϵ}}{k^{4}} .

For large enough k, we can bound the inequality above as,

E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \leq (1 - \frac{1.5}{k}) E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] + \frac{B_{2}}{k^{2}},

(45)

where $B_{2} = Γ_{u}^{2} σ^{2} / μ^{2} + 2 B_{η} B_{1} + B_{η}^{2} + B_{ϵ}$ . Using Lemma 29, stated next, we conclude $E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \to 0$ . ■

Lemma 29 Let a > 1, b ≥ 0 and {x_t} be a non-negative sequence which satisfies,

x_{t + 1} \leq (1 - \frac{a}{t}) x_{t} + \frac{b}{t^{2}}, for t \geq t^{'} > 0 .

Then for all t ≥ t’ we have,

x_{t} \leq \frac{m}{t},

where m ≔ max{t′x_t′,b/(a − 1)}.

This lemma is stated and proved for t′ = 1 in (Rakhlin et al., 2012, Lemma 3), and the case of general t′ follows immediately.

We are almost ready to complete the proof of Theorem 15; all that is needed is to refine the convergence rate of $\bar{w} (k)$ to x*. Now as a consequence of (45) and Lemma 29, we may use the inequality $E [| X |] \leq \sqrt{E [X^{2}]}$ to obtain that

E [‖ \bar{w} (k) - z^{*} ‖] = O_{k} (\frac{1}{\sqrt{k}}) .

(46)

Furthermore, by the finite support of $μ k \bar{ε} (k)$ , by Corollary 25, we also have that

E [‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖] = O_{k} (\frac{1}{\sqrt{k}}) .

(47)

We now use these observations to provide a proof of our main result.

Proof of Theorem 15 Essentially, we rewrite the proof of Lemma 28, but now using the fact that $E [‖ \bar{w} (k) - z^{*} ‖] = O_{k} (1 / \sqrt{k})$ from (46). This allows us to make two modification to the arguments of that lemma. First, we can now replace (43) by

E [2 ‖ η (k) ‖ ‖ \bar{w} (k) - z^{*} - \frac{1}{μ k} \nabla F (\bar{w} (k)) - \bar{ε} (k) ‖] \leq \frac{2 B_{η}}{k^{2}} O_{k} (\frac{1}{\sqrt{k}}),

(48)

where we used (47). Second, putting together (42), (48), and (44), we obtain:

E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \leq (1 - \frac{2}{k} + \frac{L^{2}}{μ^{2} k^{2}}) E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] + E [{‖ \bar{ε} (k) ‖}^{2}] + \frac{B_{η}^{2}}{k^{4}} + \frac{2 B_{η}}{k^{2}} O_{k} (\frac{1}{\sqrt{k}}) .

which, again using the fact that $E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] = O_{k} (1 / \sqrt{k})$ , we simply rewrite as,

E [{‖ \bar{w} (k + 1) - z^{*} ‖}^{2}] \leq (1 - \frac{2}{k}) E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] + E [{‖ \bar{ε} (k) ‖}^{2}] + O_{k} (\frac{1}{k^{2.5}}) .

To save space, let us define $a_{k} ≔ E [{‖ \bar{w} (k) - z^{*} ‖}^{2}]$ . Multiplying both sides of relation above by k² we obtain,

a_{k + 1} k^{2} \leq a_{k} (1 - \frac{2}{k}) k^{2} + E [{‖ \bar{ε} (k) ‖}^{2}] k^{2} + O_{k} (k^{- 0.5}) .

Note that,

(1 - \frac{2}{k}) k^{2} = k^{2} - 2 k < {(k - 1)}^{2} .

Thus,

a_{k + 1} k^{2} \leq a_{k} {(k - 1)}^{2} + E [{‖ \bar{ε} (k) ‖}^{2}] k^{2} + O_{k} (k^{- 0.5}) .

Summing the relation above for k = 0, …, T implies,

a_{T + 1} T^{2} \leq \sum_{k = 0}^{T} E [{‖ \bar{ε} (k) ‖}^{2}] k^{2} + O_{T} (T^{0.5}) .

Now, let us estimate the first term on the right hand side of relation above,

\sum_{k = 0}^{T} E [{‖ \bar{ε} (k) ‖}^{2}] k^{2} \leq \sum_{k = 0}^{T} \sum_{i = 1}^{n} \frac{β_{i}^{2} (k)}{n^{2}} σ_{i}^{2} τ_{i} (k) k^{2} = \sum_{i = 1}^{n} \frac{σ_{i}^{2}}{μ^{2}} \sum_{k = 0}^{T} ν_{i} {(k)}^{2} τ_{i} (k) + O_{T} (T^{- 1}),

where we used Lemma 24 in the last equality. Define t_i(j) as the j’th time agent i has woken up, and set t_i(0) = −1. Then we can rewrite the relation above as,

\sum_{k = 0}^{T} ν_{i} {(k)}^{2} τ_{i} (k) = \sum_{j = 1}^{t_{i} (j) \leq T} {(t_{i} (j) - t_{i} (j - 1))}^{2} \leq \sum_{j = 1}^{t_{i} (j) \leq T} Γ_{u} (t_{i} (j) - t_{i} (j - 1)) \leq Γ_{u} (T + 1) .

Combining relations above and then dividing both sides by T² we obtain,

a_{T + 1} \leq \frac{Γ_{u} σ^{2}}{μ^{2} T} + O_{T} (T^{- 1.5}) .

(49)

We next argue that the same guarantee holds for every z_i(k). Indeed, for each i = 1, …, m,

{‖ z_{i} (k) - z^{*} ‖}^{2} = {‖ z_{i} (k) - \bar{w} (k) + \bar{w} (k) - z^{*} ‖}^{2} = {‖ z_{i} (k) - \bar{w} (k) ‖}^{2} + 2 ‖ z_{i} (k) - \bar{w} (k) ‖ ‖ \bar{w} (k) - z^{*} ‖ + {‖ \bar{w} (k) - z^{*} ‖}^{2} .

Now from Corollary 22, we know that with probability one, ${‖ z_{i} (k) - w (k) ‖}_{2} = O_{k} (1 / k)$ . Taking expectation of both sides and using (49) along with the usual bound $E [| X |] \leq \sqrt{E [X^{2}]}$ , we have

E [{‖ z_{i} (k) - z^{*} ‖}^{2}] = O_{k} (\frac{1}{k^{2}}) + O_{k} (\frac{1}{k^{1.5}}) + E [{‖ \bar{w} (k) - z^{*} ‖}^{2}] .

Putting this together with (49) completes the proof. ■

3.1. Time-Varying Graphs

We remark that Theorems 6, 14 and 15 all extend verbatim to the case of time-varying graphs with no message losses. Indeed, only one problem appears in extending the proofs in this paper to time-varying graphs: a node i may send a message to node j; that message will be lost; and afterwards node i never sends anything to node j again. In this case, Lemmas 7 and 11 do not hold. Indeed, examining Lemma 11, we observe what can very well happen is that all of χ_i(k) and ψ_i(k) are “lost” over time into messages that never arrive. However, as long as no messages are lost, the proofs in this paper extend to the time-varying case verbatim. On a technical level, the results still hold if $u_{ij}^{x} (k) = 0$ , $u_{ij}^{y} (k) = 0$ (virtual node $c_{i j} \in V_{A}$ holds no lost message), when link (i,j) is removed from the network at time k, and the graph $G$ stays strongly connected (or B-connected, i.e., there exists a positive integer B such that the union of every B consecutive graphs is strongly connected).

3.2. On the Bounds for Delays, Asynchrony, and Message Losses

It is natural to what extent the assumption of finite upper bounds on delays, asynchrony, and message losses are really necessary. A natural example which falls outside our framework is a fixed graph G, where, at each time step, every link in G appears with probability 1/2. A more general model might involve a different probability p_e of failure for each edge e.

We observe that our result can already handle this case in the following manner. For simplicity, let us stick with the scenario where every link appears with probability 1/2. Then the probability that, after time t, some link has not appeared is at most m(1/2)^t, where m is the number of edges in G. This implies that if we choose B = O(log(mnT)), then with hight probability, the sequence of graphs G₁,…G_T is B-connected.

Thus our theorem applies to this case, albeit at the expense of some logarithmic factors due to the choice of B. We remark that it is possible to get rid of these factors by directly analyzing the decrease in $E [{‖ z (t) - z^{*} ‖}_{2}^{2}]$ coming from the random choice of graph G. Since our arguments are already quite lengthy, we do not pursue this generalization here, and refer the reader to Lobel and Ozdaglar (2010); Srivastava and Nedic (2011) where similar arguments have been made.

4. Numerical Simulations

4.1. Setup

In this section, we simulate the RASGP algorithm on two classes of graphs, namely, random directed graphs and bidirectional cycle graphs. The main objective function is chosen to be a strongly convex and smooth Support Vector Machine (SVM), i.e. $F (ω, γ) = \frac{1}{2} ({‖ ω ‖}^{2} + γ^{2}) + C_{N} \sum_{j = 1}^{N} h (b_{j} (A_{j}^{⊺} ω + γ))$ where $ω \in ℝ^{d - 1}$ and $γ \in ℝ$ are the optimization variables, and $A_{j} \in ℝ^{d - 1}, b_{j} \in {- 1, + 1}$ , j = 1, …, N are the data points and their labels, respectively. The coefficient $C_{N} \in ℝ$ penalizes the points outside of the soft margin. We set C_N = c/N, c = 500 in our simulations, which depends on the total number of data points. Here, $h : ℝ \to ℝ$ is the smoothed hinge loss, initially introduced in Rennie and Srebro (2005), defined as follows:

h (ξ) = {\begin{matrix} - 0.5 - ξ, & if ξ < 0, \\ 0.5 {(1 - ξ)}^{2}, & if 0 \leq ξ < 1, \\ 0, & if 1 \leq ξ . \end{matrix}

To solve this problem in a distributed way, we suppose all data points are spread among agents. Hence, the local objective functions are $f_{i} (ω_{i}, γ_{i}) = \frac{1}{2 n} ({‖ ω ‖}^{2} + γ^{2}) + C_{N} \sum_{j \in D_{i}} h (b_{j} (A_{j}^{⊺} ω + γ))$ , where D_i ⊂ {1,2, …, N} is an index set for data points of agent i and N is the total number of data points. We choose the size of the data set for each local function to be a constant (|D_i| = 50), thus N = 50n. It is easy to check that each f_i has Lipschitz gradients and is strongly convex with μ_i = 1/n.

We will compare our results with a centralized gradient descent algorithm, which updates every Γ_u iterations using the step-size sequence α_c(k) = Γ_u/(μk), in the direction of the sum of the gradients of all agents.

To make gradient estimates stochastic, we add a uniformly distributed noise $ε_{i} \sim U {[- b / 2, b / 2]}^{d}$ to the gradient estimates of each agent and $ε_{c} \sim U {[- \sqrt{n b / 2}, \sqrt{n b / 2}]}^{d}$ to the gradient of the centralized gradient descent, where $U {[b_{1}, b_{2}]}^{d}$ denotes the uniform distribution of size d over the interval [b₁,b₂), b₁ < b₂. Note that ε_i and ε_c are bounded and have zero mean and $E [{‖ ε_{i} ‖}^{2}] = d b^{2} / 12$ and $E [{‖ ε_{c} ‖}^{2}] = d b^{2} / 12$ . We set b = 4 for all simulations.

Agents wake up with probability P_w and links fail with probability P_f, unless they reach their maximum allowed value where the algorithm forces the agent to wake up or the link to work successfully. The link delays are chosen uniformly between 1 to Γ_del.

Each data set D_i is synthetically generated by picking 25 data points around each of the centers (1,1) and (3,3) with multivariate normal distributions, labeled −1 and +1, respectively. In generating strongly connected random graphs, we pick each edge with a probability of 0.5 and then check if the resulting graph is strongly connected; if it isn’t, we repeat the process. Since the initial step-sizes for the distributed algorithm can be very large (e.g., α(1) = 50 for n = 50), to stabilize the algorithms, both algorithms are started with k₀ = 100. This wouldn’t affect the asymptotic convergence performance. Moreover, the initial point of the centralized algorithm and all agents in RASGP are chosen as 1_d.

Let us denote by $z (k) ≔ (1 / n) \sum_{i = 1}^{n} {z=}_{i} (k)$ the average of z-values of non-virtual agents. Then, we define optimization errors $E_{d i s t} ≔ {‖ z (k) - z^{*} ‖}^{2}$ and $E_{c} (k) ≔ {‖ x_{c} (k) - z^{*} ‖}^{2}$ for RASGP and Centralized stochastic gradient descent, respectively.

Since our performance guarantees are for the expectation of (squared) errors, for each network setting, we perform up to 1000 Monte-Carlo simulations and use their corresponding performance to estimate the average behavior of the algorithms. Since accurately estimating the true expected value requires an extremely large number of simulations, in order to alleviate the effect of spikes and high variance, we take the following steps. First a batch of simulations are performed and their average is calculated. Next, to obtain a smoother plot, an average over every 100 iterations is taken. And finally, the median of these outputs over all the batches is our estimate of the expected value.

We report two figures for each setting: one including the errors E_dist and E_c, and another one including k × E_dist and k × E_c to demonstrate the convergence rates.

Finally, to study the non-asymptotic behavior of RASGP and its dependence on network size n, we have compared the performance of the centralized stochastic gradient descent and RASGP over a bidirectional cycle graph, with error variances of $n^{2} {\hat{σ}}^{2}$ and $σ_{i}^{2} = {\hat{σ}}^{2}$ , respectively. Then, we plot the ratio E_c(k)/E_dist(k) over n, for different iterations k.

4.2. Results

Our simulation results are consistent with our theoretical claims (due to the performance of centralized and decentralized methods growing closer over time) and show the achievement of an asymptotic network-independent convergence rate.

Fig. 3 shows that when there is no link failure or delay and all agents wake up at every iteration (Γ_s = 2), RASGP and centralized gradient descent have very similar performance. When we allow links to have delays and failures (see Fig. 4), as well as asynchronous updates (see Fig. 5), it takes longer for RASGP to reach its asymptotic convergence rate.

Figure 3: — Results on a directed cycle graph of size n = 50, synchronous with no delays and link failures (*P_w* = 1, *P_f* = 0, Γ_del = Γ_f = 0, Γ_u = 1, Γ_s = 2).

Figure 4: — Results on a directed cycle graph of size n = 50, synchronous with delays and link failures (*P_w* = 1, *P_f* = 0.3, Γ_del = Γ_f = 3, Γ_u = 1, Γ_s = 7).

Figure 5: — Results on a directed cycle graph of size n = 50, asynchronous with delays and link failures (*P_w* = 0.5, *P_f* = 0.3, Γ_del = Γ_f = 3, Γ_u = 3, Γ_s = 17).

We observe that, with all the other parameters fixed, the RASGP performs better on a random graph than on a cycle graph (see Figs. 5 and 6). A possible reason is that the cycle graph has a higher diameter or mixing time compared to the random graph, resulting in a slower decay of the consensus error.

Figure 6: — Results on a directed random graph of size n = 50, asynchronous with delays and link failures (*P_w* = 0.5, *P_f* = 0.3, Γ_del = Γ_f = 3, Γ_u = 3, Γ_s = 17).

We notice that by fixing the network size, increasing the number of iterations brings us closer to linear speed-up (see Fig. 7). On the other hand, when fixing the number of iterations, increasing the number of nodes, after a certain point, does not help speeding up the optimization. Moreover, by allowing link delays and failures (see Fig. 7b) we require more iterations to achieve network independence.

Figure 7: — Error ratio over network size. Shaded areas correspond to 1-standard-deviation of the performance.

5. Conclusions

The main result of this paper is to stablish asymptotically, network independent performance for a distributed stochastic optimization method over directed graphs with message losses, delays, and asynchronous updates. Our work raises several open questions.

The most natural question raised by this paper concerns the size of the transients. How long must the nodes wait until the network-independent performance bound is achieved? The answer, of course, will depend on the network, but also on the number of nodes, the degree of asynchrony, and the delays. Understanding how this quantity scales is required before the algorithms presented in this work can be recommended to practitioners.

More generally, it is interesting to ask which problems in distributed optimization can achieve network-independent performance, even asymptotically. For example, the usual bounds for distributed subgradient descent (see, e.g., Nedic et al., 2018) depend on the spectral gap of the underlying network; various worst-case scalings with the number of nodes can be derived, and the final asymptotics are not network-independent. It is not immediately clear whether this is due to the analysis, or a fundamental limitation that will not be overcome.

Acknowledgments

The authors acknowledge support for this project by the AFOSR under grant FA9550-15-1-0394, by the ONR under grant N000014-16-1-224 and MURI N00014-19-1-2571, by the NSF under grants IIS-1914792, DMS-1664644, and CNS-1645681, and by the NIH under grant 1R01GM135930. A preliminary version of the results in Section 2 has been published in the proceedings of the American Control Conference 2018 (Olshevsky et al., 2018).

Appendix A. Proof of Lemma 4

Proof We use mathematical induction. For k = 0 we have $x_{i j}^{l} (0) = 0$ , ∀l and $u_{i j}^{x} (0) = ϕ_{i}^{x} (0) = ρ_{j i}^{x} (0) = 0$ . By (6) and the definition of $u_{i j}^{x}$ and $x_{i j}^{l}$ we obtain,

ρ_{j i}^{x} (1) = 0, u_{i j}^{x} (1) = (1 - \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (0)) ϕ_{i}^{x} (1), \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (1) = (\sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (0)) ϕ_{i}^{x} (1)

Equation (12) is concluded from first equation above and (13) results by summing up all three equations above.

Now assume this lemma is true for k = 0, …, K − 1. We want to show it will be true for k = K as well. In the following, LHS and RHS denote the left-hand-side and right-hand-side of (12) for k = K. By (6) we have,

L H S = \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (K - l) [ϕ_{i}^{x} (K + 1 - l) - ρ_{j i}^{x} (K)] .

Using (11) we obtain,

R H S = \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (K - l) v_{i j}^{x} (K - l) .

Hence, it suffices to show that:

\sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (K - l) [ϕ_{i}^{x} (K + 1 - l) - ρ_{j i}^{x} (K) - v_{i j}^{x} (K - l)] = 0 .

(50)

By part (e) of Assumption 1, at most one of the $τ_{i j}^{l} (K - l)$ , l = 1, …, Γ_d is non-zero. If all are zeros, the result follows. Now suppose $τ_{i j}^{l} (K - l) = 1$ for some l. Equation (50) becomes,

ϕ_{i}^{x} (K + 1 - l) - ρ_{j i}^{x} (K) - v_{i j}^{x} (K - l) = 0 .

Plugging in the definition of $v_{i j}^{x}$ , after rearrangement we obtain,

ϕ_{i}^{x} (K - l) - u_{i j}^{x} (K - l) = ρ_{j i}^{x} (K) .

(51)

By the induction hypothesis, (12) holds for k = K − t, t = 1, …, l. Therefore,

ρ_{j i}^{x} (K + 1 - t) - ρ_{j i}^{x} (K - t) = x_{i j}^{1} (K - t) .

Hence,

ρ_{j i}^{x} (K) = ρ_{j i}^{x} (K - l) + \sum_{t = 1}^{l} (ρ_{j i}^{x} (K + 1 - t) - ρ_{j i}^{x} (K - t)) = ρ_{j i}^{x} (K - l) + \sum_{t = 1}^{l} x_{i j}^{1} (K - t) = ρ_{j i}^{x} (K - l) \sum_{l^{'} = 1}^{d} x_{i j}^{l^{'}} (K - l) (Lemma 3) = ρ_{j i}^{x} (K - l) + \sum_{l^{'} = 1}^{d} x_{i j}^{l^{'}} (K - l) . (Lemma 2)

Moreover, by the induction hypothesis, (13) holds for k = K − l, thus,

ϕ_{i}^{x} (K - l) - u_{i j}^{x} (K - l) = ρ_{j i}^{x} (K - l) + \sum_{l^{'} = 1}^{Γ_{d}} x_{i j}^{l^{'}} (K - l) .

Combining the two relations above we conclude (51).

To show (13), consider the following equations which are direct results of the definitions and (12) that we just showed for k = K:

u_{i j}^{x} (K + 1) = (1 - \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (K) v_{i j}^{x} (K)), ρ_{j i}^{x} (K + 1) = ρ_{j i}^{x} (K) + x_{i j}^{1} (K), \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (K + 1) = \sum_{l = 2}^{Γ_{d}} x_{i j}^{l} (K) + \sum_{l = 1}^{Γ_{d}} τ_{i j}^{l} (K) v_{i j}^{x} (K) .

Summing up both sides of the equations above we have,

L H S = u_{i j}^{x} (K + 1) + ρ_{j i}^{x} (K + 1) + \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (K + 1), R H S = \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (K) + ρ_{j i}^{x} (K) + v_{i j}^{x} (K) = \sum_{l = 1}^{Γ_{d}} x_{i j}^{l} (K) + ρ_{j i}^{x} (K) + u_{i j}^{x} (K) - ϕ_{i}^{x} (k) + ϕ_{i}^{x} (K + 1) = ϕ_{i}^{x} (K + 1) .

The last equality holds because of the induction hypothesis (13) for k = K − 1, hence completing the proof. ■

Footnotes

^1.

It goes without saying that no analysis of distributed optimization can be wholly independent of the network or the number of nodes. Indeed, in a network of n nodes, the diameter can be as large as n − 1, which means that, in the worst case, no bounds on global performance can be obtained during the first n − 1 steps of any algorithm.

^2.

Note the difference between indexing in $τ_{i j}^{l}$ and $ρ_{j i}^{x}$ , which are both defined for link $(i, j) \in ε$ .

References

Agarwal Alekh and Duchi John C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011. [Google Scholar]
Akbari Mohammad, Gharesifard Bahman, and Linder Tamas. Distributed online convex optimization on time-varying directed graphs. IEEE Transactions on Control of Network Systems, 4(3):417–428, 2017. [Google Scholar]
Alpcan Tansu and Bauckhage Christian. A distributed machine learning framework. In 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 2546–2551. IEEE, 2009. [Google Scholar]
Assran Mahmoud and Rabbat Michael. Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950, 2018. [Google Scholar]
Bénézit Florence, Blondel Vincent, Thiran Patrick, Tsitsiklis John, and Vetterli Martin. Weighted gossip: Distributed averaging using non-doubly stochastic matrices. In 2010 IEEE International Symposium on Information Theory (ISIT), pages 1753–1757. IEEE, 2010. [Google Scholar]
Brisimi Theodora S, Chen Ruidi, Mela Theofanie, Olshevsky Alex, Paschalidis Ioannis Ch, and Shi Wei. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang Tsung-Hui, Hong Mingyi, Liao Wei-Cheng, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart I: algorithm and convergence analysis. IEEE Transactions on Signal Processing, 64(12):3118–3130, 2016a. [Google Scholar]
Chang Tsung-Hui, Liao Wei-Cheng, Hong Mingyi, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart II: Linear convergence analysis and numerical performance. IEEE Transactions on Signal Processing, 64(12):3131–3144, 2016b. [Google Scholar]
Chen Jianshu and Sayed Ali H. On the learning behavior of adaptive networkspart II: Performance analysis. IEEE Transactions on Information Theory, 61(6):3518–3548, 2015. [Google Scholar]
Di Lorenzo Paolo and Scutari Gesualdo. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016. [Google Scholar]
Dominguez-Garcia Alejandro D and Hadjicostis Christoforos N. Distributed matrix scaling and application to average consensus in directed graphs. IEEE Transactions on Automatic Control, 58(3):667–681, 2013. [Google Scholar]
Domínguez-García Alejandro D and Hadjicostis Christoforos N. Convergence rate of a distributed algorithm for matrix scaling to doubly stochastic form. In 53rd IEEE Conference on Decision and Control, pages 3240–3245. IEEE, 2014. [Google Scholar]
Feyzmahdavian Hamid Reza, Aytekin Arda, and Johansson Mikael. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016. [Google Scholar]
Gharesifard Bahman and Cortes Jorge. Distributed strategies for generating weight-balanced and doubly stochastic digraphs. European Journal of Control, 18(6):539–557, 2012. [Google Scholar]
Hadjicostis Christoforos N, Vaidya Nitin H, and Domínguez-García Alejandro D. Robust distributed average consensus via exchange of running sums. IEEE Transactions on Automatic Control, 61(6):1492–1507, 2016. [Google Scholar]
Hadjicostis Christoforos N, Dominguez-Garcia Alejandro D, Charalambous Themistokis, et al. Distributed averaging and balancing in network systems: with applications to coordination and control. Foundations and Trends© in Systems and Control, 5(2–3): 99–292, 2018. [Google Scholar]
He Shibo, Shin Dong-Hoon, Zhang Junshan, Chen Jiming, and Sun Youxian. Full-view area coverage in camera sensor networks: Dimension reduction and near-optimal solutions. IEEE Transactions on Vehicular Technology, 65(9):7448–7461, 2015. [Google Scholar]
Hong Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM approach. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
Kempe David, Dobra Alin, and Gehrke Johannes. Gossip-based computation of aggregate information. In Foundations of Computer Science, 2003. 44th Annual IEEE Symposium on, pages 482–491. IEEE, 2003. [Google Scholar]
Koloskova Anastasiia, Stich Sebastian Urban, and Jaggi Martin. Decentralized stochastic optimization and gossip algorithms with compressed communication. Machine Learning Research, 97(CONF), 2019. [Google Scholar]
Lan Guanghui, Lee Soomin, and Zhou Yi. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48, 2018. [Google Scholar]
Li Mu, Andersen David G, Park Jun Woo, Smola Alexander J, Ahmed Amr, Josifovski Vanja, Long James. Shekita Eugene J, and Su Bor-Yiing. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014. [Google Scholar]
Lian Xiangru, Huang Yijun, Li Yuncheng, and Liu Ji. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015. [Google Scholar]
Lian Xiangru, Zhang Ce, Zhang Huan, Hsieh Cho-Jui, Zhang Wei, and Liu Ji. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017. [Google Scholar]
Lian Xiangru, Zhang Wei, Zhang Ce, and Liu Ji. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (ICML), pages 3043–3052, 2018. [Google Scholar]
Lobel Ilan and Ozdaglar Asuman. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291–1306, 2010. [Google Scholar]
Mansoori Fatemeh and Wei Ermin. Superlinearly convergent asynchronous distributed network newton method. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2874–2879. IEEE, 2017. [Google Scholar]
Morral Gemma, Bianchi Pascal, Fort Gersende, and Jakubowicz Jeremie. Distributed stochastic approximation: The price of non-double stochasticity. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pages 1473–1477. IEEE, 2012. [Google Scholar]
Morral Gemma, Bianchi Pascal, and Fort Gersende. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017. [Google Scholar]
Nedic Angelia. Asynchronous broadcast-based convex optimization over a network. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2011. [Google Scholar]
Nedic Angelia and Olshevsky Alex. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015. [Google Scholar]
Nedic Angelia and Olshevsky Alex. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947, 2016. [Google Scholar]
Nedic Angelia and Ozdaglar Asuman. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [Google Scholar]
Nedic Angelia, Olshevsky Alex, and Shi Wei. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4): 2597–2633, 2017. [Google Scholar]
Nedic Angelia, Olshevsky Alex, and Rabbat Michael G. Network topology and communication-computation tradeoffs in decentralized optimization. IEEE, 106(5):953–976, 2018. [Google Scholar]
Nemirovski Arkadi, Juditsky Anatoli, Lan Guanghui, and Shapiro Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009. [Google Scholar]
Olshevsky Alex. Linear time average consensus and distributed optimization on fixed graphs. SIAM Journal on Control and Optimization, 55(6):3990–4014, 2017. [Google Scholar]
Olshevsky Alex, Paschalidis Ioannis Ch, and Spiridonoff Artin Fully asynchronous push-sum with growing intercommunication intervals. American Control Conference, pages 591–596, 2018. [Google Scholar]
Oreshkin Boris N, Coates Mark J, and Rabbat Michael G. Optimization and analysis of distributed averaging with short node memory. IEEE Transactions on Signal Processing, 58(5):2850–2865, 2010. [Google Scholar]
Peng Zhouhua, Wang Jun, and Wang Dan. Distributed maneuvering of autonomous surface vehicles based on neurodynamic optimization and fuzzy approximation. IEEE Transactions on Control Systems Technology, 26(3):1083–1090, 2017. [Google Scholar]
Pu Shi and Garcia Alfredo. A flocking-based approach for distributed stochastic optimization. Operations Research, 66(1):267–281, 2017. [Google Scholar]
Pu Shi and Nedic Angelia. A distributed stochastic gradient tracking method. In 2018 IEEE Conference on Decision and Control (CDC), pages 963–968. IEEE, 2018. [Google Scholar]
Qu Guannan and Li Na. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
Qu Guannan and Li Na. Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 2019. [Google Scholar]
Rakhlin Alexander, Shamir Ohad, Sridharan Karthik, et al. Making gradient descent optimal for strongly convex stochastic optimization. In 29th International Conference on Machine Learning (ICML), volume 12, pages 1571–1578. Citeseer, 2012. [Google Scholar]
Ram S Sundhar, Nedic Angelia, and Veeravalli Venugopal V. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010. [Google Scholar]
Recht Benjamin, Re Christopher, Wright Stephen, and Niu Feng. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [Google Scholar]
Rennie Jason DM and Srebro Nathan. Loss functions for preference levels: Regression with discrete ordered labels. In IJCAI multidisciplinary workshop on advances in preference handling, pages 180–186. Kluwer; Norwell, MA, 2005. [Google Scholar]
Scaman Kevin, Bach Francis, Bubeck Sebastien, Lee Yin Tat, and Massoulié Laurent. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In 34th International Conference on Machine Learning (ICML)-Volume 70, pages 3027–3036. JMLR.org, 2017. [Google Scholar]
Shi Wei, Ling Qing, Wu Gang, and Yin Wotao. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [Google Scholar]
Sirb Benjamin and Ye Xiaojing. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76–85. IEEE, 2016. [Google Scholar]
Srivastava Kunal and Nedic Angelia. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011. [Google Scholar]
Su Lili and Vaidya Nitin H. Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434. ACM, 2016a. [Google Scholar]
Su Lili and Vaidya Nitin H. Non-bayesian learning in the presence of byzantine agents. In International Symposium on Distributed Computing, pages 414–427. Springer, 2016b. [Google Scholar]
Su Lili and Vaidya Nitin H.. Reaching approximate byzantine consensus with multi-hop communication. Information and Computation, 255:352–368, 2017. ISSN 0890-5401. doi: 10.1016/j.ic.2016.12.003. URL http://www.sciencedirect.com/science/article/pii/S0890540116301262. SSS 2015. [DOI] [Google Scholar]
Sun Ying, Scutari Gesualdo, and Palomar Daniel. Distributed nonconvex multiagent optimization over time-varying networks. In Signals, Systems and Computers, 2016 50th Asilomar Conference on, pages 788–794. IEEE, 2016. [Google Scholar]
Tian Ye, Sun Ying, and Scutari Gesualdo. Asy-sonata: Achieving linear convergence in distributed asynchronous multiagent optimization. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 543–551. IEEE, 2018. [Google Scholar]
Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1543–1550. IEEE, 2012a. [Google Scholar]
Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Push-sum distributed dual averaging for convex optimization. In 2012 51st IEEE Conference on Decision and Control (CDC), pages 5453–5458. IEEE, 2012b. [Google Scholar]
Tsitsiklis John, Bertsekas Dimitri, and Athans Michael. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986. [Google Scholar]
Wu Tianyu, Yuan Kun, Ling Qing, Yin Wotao, and Sayed Ali H. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293–307, 2018. [Google Scholar]
Xi Chenguang and Khan Usman A. Dextra: A fast algorithm for optimization over directed graphs. IEEE Transactions on Automatic Control, 62(10):4980–4993, 2017a. [Google Scholar]
Xi Chenguang and Khan Usman A. Distributed subgradient projection algorithm over directed graphs. IEEE Transactions on Automatic Control, 62(8):3986–3992, 2017b. [Google Scholar]
Xi Chenguang, Xin Ran, and Khan Usman A. Add-opt: Accelerated distributed directed optimization. IEEE Transactions on Automatic Control, 63(5):1329–1339, 2018. [Google Scholar]
Xu Jinming, Zhu Shanying, Soh Yeng Chai, and Xie Lihua. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060. IEEE, 2015. [Google Scholar]
Yuan Kun, Ling Qing, and Yin Wotao. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [Google Scholar]

[R1] Agarwal Alekh and Duchi John C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011. [Google Scholar]

[R2] Akbari Mohammad, Gharesifard Bahman, and Linder Tamas. Distributed online convex optimization on time-varying directed graphs. IEEE Transactions on Control of Network Systems, 4(3):417–428, 2017. [Google Scholar]

[R3] Alpcan Tansu and Bauckhage Christian. A distributed machine learning framework. In 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 2546–2551. IEEE, 2009. [Google Scholar]

[R4] Assran Mahmoud and Rabbat Michael. Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950, 2018. [Google Scholar]

[R5] Bénézit Florence, Blondel Vincent, Thiran Patrick, Tsitsiklis John, and Vetterli Martin. Weighted gossip: Distributed averaging using non-doubly stochastic matrices. In 2010 IEEE International Symposium on Information Theory (ISIT), pages 1753–1757. IEEE, 2010. [Google Scholar]

[R6] Brisimi Theodora S, Chen Ruidi, Mela Theofanie, Olshevsky Alex, Paschalidis Ioannis Ch, and Shi Wei. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59–67, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chang Tsung-Hui, Hong Mingyi, Liao Wei-Cheng, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart I: algorithm and convergence analysis. IEEE Transactions on Signal Processing, 64(12):3118–3130, 2016a. [Google Scholar]

[R8] Chang Tsung-Hui, Liao Wei-Cheng, Hong Mingyi, and Wang Xiangfeng. Asynchronous distributed ADMM for large-scale optimizationpart II: Linear convergence analysis and numerical performance. IEEE Transactions on Signal Processing, 64(12):3131–3144, 2016b. [Google Scholar]

[R9] Chen Jianshu and Sayed Ali H. On the learning behavior of adaptive networkspart II: Performance analysis. IEEE Transactions on Information Theory, 61(6):3518–3548, 2015. [Google Scholar]

[R10] Di Lorenzo Paolo and Scutari Gesualdo. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016. [Google Scholar]

[R11] Dominguez-Garcia Alejandro D and Hadjicostis Christoforos N. Distributed matrix scaling and application to average consensus in directed graphs. IEEE Transactions on Automatic Control, 58(3):667–681, 2013. [Google Scholar]

[R12] Domínguez-García Alejandro D and Hadjicostis Christoforos N. Convergence rate of a distributed algorithm for matrix scaling to doubly stochastic form. In 53rd IEEE Conference on Decision and Control, pages 3240–3245. IEEE, 2014. [Google Scholar]

[R13] Feyzmahdavian Hamid Reza, Aytekin Arda, and Johansson Mikael. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740–3754, 2016. [Google Scholar]

[R14] Gharesifard Bahman and Cortes Jorge. Distributed strategies for generating weight-balanced and doubly stochastic digraphs. European Journal of Control, 18(6):539–557, 2012. [Google Scholar]

[R15] Hadjicostis Christoforos N, Vaidya Nitin H, and Domínguez-García Alejandro D. Robust distributed average consensus via exchange of running sums. IEEE Transactions on Automatic Control, 61(6):1492–1507, 2016. [Google Scholar]

[R16] Hadjicostis Christoforos N, Dominguez-Garcia Alejandro D, Charalambous Themistokis, et al. Distributed averaging and balancing in network systems: with applications to coordination and control. Foundations and Trends© in Systems and Control, 5(2–3): 99–292, 2018. [Google Scholar]

[R17] He Shibo, Shin Dong-Hoon, Zhang Junshan, Chen Jiming, and Sun Youxian. Full-view area coverage in camera sensor networks: Dimension reduction and near-optimal solutions. IEEE Transactions on Vehicular Technology, 65(9):7448–7461, 2015. [Google Scholar]

[R18] Hong Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM approach. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]

[R19] Kempe David, Dobra Alin, and Gehrke Johannes. Gossip-based computation of aggregate information. In Foundations of Computer Science, 2003. 44th Annual IEEE Symposium on, pages 482–491. IEEE, 2003. [Google Scholar]

[R20] Koloskova Anastasiia, Stich Sebastian Urban, and Jaggi Martin. Decentralized stochastic optimization and gossip algorithms with compressed communication. Machine Learning Research, 97(CONF), 2019. [Google Scholar]

[R21] Lan Guanghui, Lee Soomin, and Zhou Yi. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48, 2018. [Google Scholar]

[R22] Li Mu, Andersen David G, Park Jun Woo, Smola Alexander J, Ahmed Amr, Josifovski Vanja, Long James. Shekita Eugene J, and Su Bor-Yiing. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 583–598, 2014. [Google Scholar]

[R23] Lian Xiangru, Huang Yijun, Li Yuncheng, and Liu Ji. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015. [Google Scholar]

[R24] Lian Xiangru, Zhang Ce, Zhang Huan, Hsieh Cho-Jui, Zhang Wei, and Liu Ji. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017. [Google Scholar]

[R25] Lian Xiangru, Zhang Wei, Zhang Ce, and Liu Ji. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (ICML), pages 3043–3052, 2018. [Google Scholar]

[R26] Lobel Ilan and Ozdaglar Asuman. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291–1306, 2010. [Google Scholar]

[R27] Mansoori Fatemeh and Wei Ermin. Superlinearly convergent asynchronous distributed network newton method. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2874–2879. IEEE, 2017. [Google Scholar]

[R28] Morral Gemma, Bianchi Pascal, Fort Gersende, and Jakubowicz Jeremie. Distributed stochastic approximation: The price of non-double stochasticity. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pages 1473–1477. IEEE, 2012. [Google Scholar]

[R29] Morral Gemma, Bianchi Pascal, and Fort Gersende. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017. [Google Scholar]

[R30] Nedic Angelia. Asynchronous broadcast-based convex optimization over a network. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2011. [Google Scholar]

[R31] Nedic Angelia and Olshevsky Alex. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015. [Google Scholar]

[R32] Nedic Angelia and Olshevsky Alex. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947, 2016. [Google Scholar]

[R33] Nedic Angelia and Ozdaglar Asuman. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [Google Scholar]

[R34] Nedic Angelia, Olshevsky Alex, and Shi Wei. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4): 2597–2633, 2017. [Google Scholar]

[R35] Nedic Angelia, Olshevsky Alex, and Rabbat Michael G. Network topology and communication-computation tradeoffs in decentralized optimization. IEEE, 106(5):953–976, 2018. [Google Scholar]

[R36] Nemirovski Arkadi, Juditsky Anatoli, Lan Guanghui, and Shapiro Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009. [Google Scholar]

[R37] Olshevsky Alex. Linear time average consensus and distributed optimization on fixed graphs. SIAM Journal on Control and Optimization, 55(6):3990–4014, 2017. [Google Scholar]

[R38] Olshevsky Alex, Paschalidis Ioannis Ch, and Spiridonoff Artin Fully asynchronous push-sum with growing intercommunication intervals. American Control Conference, pages 591–596, 2018. [Google Scholar]

[R39] Oreshkin Boris N, Coates Mark J, and Rabbat Michael G. Optimization and analysis of distributed averaging with short node memory. IEEE Transactions on Signal Processing, 58(5):2850–2865, 2010. [Google Scholar]

[R40] Peng Zhouhua, Wang Jun, and Wang Dan. Distributed maneuvering of autonomous surface vehicles based on neurodynamic optimization and fuzzy approximation. IEEE Transactions on Control Systems Technology, 26(3):1083–1090, 2017. [Google Scholar]

[R41] Pu Shi and Garcia Alfredo. A flocking-based approach for distributed stochastic optimization. Operations Research, 66(1):267–281, 2017. [Google Scholar]

[R42] Pu Shi and Nedic Angelia. A distributed stochastic gradient tracking method. In 2018 IEEE Conference on Decision and Control (CDC), pages 963–968. IEEE, 2018. [Google Scholar]

[R43] Qu Guannan and Li Na. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]

[R44] Qu Guannan and Li Na. Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 2019. [Google Scholar]

[R45] Rakhlin Alexander, Shamir Ohad, Sridharan Karthik, et al. Making gradient descent optimal for strongly convex stochastic optimization. In 29th International Conference on Machine Learning (ICML), volume 12, pages 1571–1578. Citeseer, 2012. [Google Scholar]

[R46] Ram S Sundhar, Nedic Angelia, and Veeravalli Venugopal V. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010. [Google Scholar]

[R47] Recht Benjamin, Re Christopher, Wright Stephen, and Niu Feng. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [Google Scholar]

[R48] Rennie Jason DM and Srebro Nathan. Loss functions for preference levels: Regression with discrete ordered labels. In IJCAI multidisciplinary workshop on advances in preference handling, pages 180–186. Kluwer; Norwell, MA, 2005. [Google Scholar]

[R49] Scaman Kevin, Bach Francis, Bubeck Sebastien, Lee Yin Tat, and Massoulié Laurent. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In 34th International Conference on Machine Learning (ICML)-Volume 70, pages 3027–3036. JMLR.org, 2017. [Google Scholar]

[R50] Shi Wei, Ling Qing, Wu Gang, and Yin Wotao. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [Google Scholar]

[R51] Sirb Benjamin and Ye Xiaojing. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76–85. IEEE, 2016. [Google Scholar]

[R52] Srivastava Kunal and Nedic Angelia. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011. [Google Scholar]

[R53] Su Lili and Vaidya Nitin H. Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, pages 425–434. ACM, 2016a. [Google Scholar]

[R54] Su Lili and Vaidya Nitin H. Non-bayesian learning in the presence of byzantine agents. In International Symposium on Distributed Computing, pages 414–427. Springer, 2016b. [Google Scholar]

[R55] Su Lili and Vaidya Nitin H.. Reaching approximate byzantine consensus with multi-hop communication. Information and Computation, 255:352–368, 2017. ISSN 0890-5401. doi: 10.1016/j.ic.2016.12.003. URL http://www.sciencedirect.com/science/article/pii/S0890540116301262. SSS 2015. [DOI] [Google Scholar]

[R56] Sun Ying, Scutari Gesualdo, and Palomar Daniel. Distributed nonconvex multiagent optimization over time-varying networks. In Signals, Systems and Computers, 2016 50th Asilomar Conference on, pages 788–794. IEEE, 2016. [Google Scholar]

[R57] Tian Ye, Sun Ying, and Scutari Gesualdo. Asy-sonata: Achieving linear convergence in distributed asynchronous multiagent optimization. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 543–551. IEEE, 2018. [Google Scholar]

[R58] Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1543–1550. IEEE, 2012a. [Google Scholar]

[R59] Tsianos Konstantinos I, Lawlor Sean, and Rabbat Michael G. Push-sum distributed dual averaging for convex optimization. In 2012 51st IEEE Conference on Decision and Control (CDC), pages 5453–5458. IEEE, 2012b. [Google Scholar]

[R60] Tsitsiklis John, Bertsekas Dimitri, and Athans Michael. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, 1986. [Google Scholar]

[R61] Wu Tianyu, Yuan Kun, Ling Qing, Yin Wotao, and Sayed Ali H. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293–307, 2018. [Google Scholar]

[R62] Xi Chenguang and Khan Usman A. Dextra: A fast algorithm for optimization over directed graphs. IEEE Transactions on Automatic Control, 62(10):4980–4993, 2017a. [Google Scholar]

[R63] Xi Chenguang and Khan Usman A. Distributed subgradient projection algorithm over directed graphs. IEEE Transactions on Automatic Control, 62(8):3986–3992, 2017b. [Google Scholar]

[R64] Xi Chenguang, Xin Ran, and Khan Usman A. Add-opt: Accelerated distributed directed optimization. IEEE Transactions on Automatic Control, 63(5):1329–1339, 2018. [Google Scholar]

[R65] Xu Jinming, Zhu Shanying, Soh Yeng Chai, and Xie Lihua. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060. IEEE, 2015. [Google Scholar]

[R66] Yuan Kun, Ling Qing, and Yin Wotao. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [Google Scholar]

PERMALINK

Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions

Artin Spiridonoff

Alex Olshevsky

Ioannis Ch Paschalidis

Abstract

1. Introduction

1.1. Literature Review

Figure 1:

1.2. Our Contribution

1.3. Organization of This Paper

1.4. Notations and Definitions

2. Push-Sum with Delays and Link Failures

Algorithm 1.

2.1. Linear Formulation

Figure 2:

2.2. Exponential Convergence

2.3. Perturbed Push-Sum

Algorithm 2.

3. Robust Asynchronous Stochastic Gradient-Push (RASGP)

Algorithm 3.

3.1. Time-Varying Graphs

3.2. On the Bounds for Delays, Asynchrony, and Message Losses

4. Numerical Simulations

4.1. Setup

4.2. Results

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

5. Conclusions

Acknowledgments

Appendix A. Proof of Lemma 4

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases