Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 1.
Published in final edited form as: IEEE Trans Automat Contr. 2021 Nov 9;67(11):5900–5915. doi: 10.1109/tac.2021.3126253

A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

Shi Pu 1, Alex Olshevsky 2, Ioannis Ch Paschalidis 3
PMCID: PMC10241409  NIHMSID: NIHMS1845323  PMID: 37284602

Abstract

This paper is concerned with minimizing the average of n cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, in expectation, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate. Moreover, we construct a “hard” optimization problem that proves the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.

Index Terms—: distributed optimization, convex optimization, stochastic programming, stochastic gradient descent

I. Introduction

We consider the distributed optimization problem where a group of agents 𝓝={1,2,,n} collaboratively seek an xp that minimizes the average of n cost functions:

minxpfx=1ni=1nfi(x). (1)

Each local cost function fi:p is strongly convex, with Lipschitz continuous gradient, and is known by agent i only. The agents communicate and exchange information over a network. Problems in the form of (1) find applications in multi-agent target seeking [1], distributed machine learning [2], [3], [4], [5], [6], [7], [8], and wireless networks [9], [10], [5], among other scenarios.

In order to solve (1), we assume that at each iteration k ≥ 0, the algorithm we study is able to obtain noisy gradient estimates gi(xi(k), ξi(k)), where xi(k) is the input for agent i, that satisfy the following condition.

Assumption 1. For all k ≥ 0, each random vector ξikm is independent across i𝓝. Denote by 𝓕(k) the σ-algebra generated by xi(0),xi(1),,xi(k)i𝓝. Then,

Eξikgixi(k),ξi(k)|𝓕(k)=fixi(k),Eξi(k)gixi(k),ξi(k)fixi(k)2|𝓕(k)σ2+Mfixi(k)2,forsome σ,M>0. (2)

Stochastic gradients appear in many machine learning problems. For example, suppose fi(x):=Eξi~𝓓iFix,ξi represents the expected loss function for agent i, where ξi are independent data samples gathered over time, and 𝓓i represents the data distribution. Then for any x and ξi sampled from 𝓓i, gix,ξi:=Fix,ξi is an unbiased estimator of ∇fi(x). For another example, suppose fix:=1/𝓢iζj𝓢iFx,ζj denotes an empirical risk function, where 𝓢i is the local dataset for agent i. In this setting, the gradient estimation of fi(x) can incur noise from various sources such as minibatch random sampling of the local dataset and discretization for reducing communication cost [11].

Problem (1) has been studied extensively in the literature under various distributed algorithms [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], among which the distributed gradient descent (DGD) method proposed in [13] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest. Several works have shown that distributed methods may compare with their centralized counterparts under certain conditions. For example, the work in [24], [25], [26] first showed that, with sufficiently small constant stepsize, a distributed stochastic gradient method achieves comparable performance to a centralized method in terms of the steady-state mean-square-error.

Despite the aforementioned efforts, it is unclear how long, or how many iterations it takes for a distributed stochastic gradient method to reach the convergence rate of centralized SGD. The number of required iterations, called “transient time” of the algorithm, is a key measurement of the performance of the distributed implementation. In this work, we perform a non-asymptotic analysis for the distributed stochastic gradient descent (DSGD) method adapted from DGD and the diffusion strategy [1].1 In addition to showing that in expectation, the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap (1 − ρw) of the mixing matrix). Furthermore, we characterize the transient time needed for DSGD to achieve the optimal rate of convergence, which behaves as 𝓞n1ρw2 assuming certain conditions on the objective functions, stepsize policy and initial solutions. Finally, we construct a “hard” optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by Ωn1ρw2, implying the obtained transient time is sharp. These results are new to the best of our knowledge.

A. Related Works

We briefly discuss the related literature on (distributed) stochastic optimization. First of all, our work is related to stochastic approximation (SA) methods dating back to the seminal works [27] and [28]. For a strongly convex objective function f with Lipschitz continuous gradients, it has been shown that the optimal convergence rate for solving problem (1) is 𝓞1k under a diminishing stepsize policy [29].

Distributed stochastic gradient methods have received much attention in the recent years. For nonsmooth convex objective functions, the work in [30] considered distributed constrained optimization and established asymptotic convergence to the optimal set using two diminishing stepsizes to account for communication noise and subgradient errors, respectively. The paper [31] proposed a distributed dual averaging method which exhibits a convergence rate of 𝓞nlogk1λ2(W)k under a carefully chosen SA stepsize sequence, where λ2(W) is the second largest singular value of the mixing matrix W. A projected stochastic gradient algorithm was considered in [32] for solving nonconvex optimization problems by combining a local stochastic gradient update and a gossip step. This work proved that consensus is asymptotically achieved and the solutions converge to the set of KKT points with SA stepsizes. In [33], the authors proposed an adaptive diffusion algorithm based on penalty methods and showed that the expected optimization error is bounded by 𝓞(α) under a constant stepsize α. The work in [34] considered distributed constrained convex optimization under multiple noise terms in both computation and communication stages. By means of an augmented Lagrangian framework, almost sure convergence with a diminishing stepsize policy was established. [35] investigated a subgradient-push method for distributed optimization over time-varying directed graphs. For strongly convex objective functions, the method exhibits an 𝓞lnkk convergence rate. The work in [36] used a time-dependent weighted mixing of stochastic subgradient updates to achieve an 𝓞nn1λ2(W)k convergence rate for minimizing the sum of nonsmooth strongly convex functions. [37] presented a new class of distributed first-order methods for nonsmooth and stochastic optimization which was shown to exhibit an 𝓞1k (respectively, 𝓞1k) convergence rate for minimizing the sum of strongly convex functions (respectively, convex functions). The work in [38] considered a decentralized algorithm with delayed gradient information which achieves an 𝓞1k rate of convergence for general convex functions. In [39], an 𝓞1k convergence rate was established for strongly convex costs and random networks. Recently, the work in [40] proposed a variance-reduced decentralized stochastic optimization method with gradient tracking.

Several recent works have shown that distributed methods may compare with centralized algorithms under various conditions. In addition to [24], [25], [26] discussed before, [41], [42] proved that distributed stochastic approximation performs asymptotically as well as centralized schemes by means of a central limit theorem. [43] first showed that a distributed stochastic gradient algorithm asymptotically achieves comparable convergence rate to a centralized method, but assuming that all the local functions fi have the same minimum. [44], [45] demonstrated the advantage of distributively implementing a stochastic gradient method assuming that sampling times are random and non-negligible. For nonconvex objective functions, [46] proved that decentralized algorithms can achieve a linear speedup similar to a centralized algorithm when k is large enough. This result was generalized to the setting of directed communication networks in [47] for training deep neural networks. The work in [48] considered a distributed stochastic gradient tracking method which performs as well as centralized stochastic gradient descent under a small enough constant stepsize. A recent paper [49] discussed an algorithm that asymptotically performs as well as the best bounds on centralized stochastic gradient descent subject to possible message losses, delays, and asynchrony. In a parallel recent work [50], a similar result was demonstrated with a further compression technique which allowed nodes to save on communication. For more discussion on the topic of achieving asymptotic network independence in distributed stochastic optimization, the readers are referred to a recent survey [51].

B. Main Contribution

We next summarize the main contribution of the paper. First, we begin by performing a non-asymptotic convergence analysis of the distributed stochastic gradient descent (DSGD) method. For strongly convex and smooth objective functions, in expectation, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). We explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. The relevant results are established in Corollary 1 and Theorem 1.

Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate. On the one hand, we show an upper bound of KT=𝓞n1ρw2, where 1 − ρw denotes the spectral gap of the mixing matrix of communicating agents. On the other hand, we construct a “hard” optimization problem for which we show that the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by Ωn1ρw2, implying that this upper bound is sharp.

Additionally, we provide numerical experiments that demonstrate the tightness of the theoretical findings. In particular, for the ring network topology and the square grid network topology, simulations are consistent with the transient time KT=𝓞n1ρw2 for solving the on-line ridge regression problem.

C. Notation

Vectors are column vectors unless otherwise specified. Each agent i holds a local copy of the decision vector denoted by xip, and its value at iteration/time k is written as xi(k). Let

x:=x1,x2,,xnn×p,x¯:=1ni=1nxi,

where denotes transpose. Define an aggregate objective function

F(x):=i=1nfixi,

and let

F(x):=f1x1,f2x2,,fnxnn×p,¯F(x):=1ni=1nfixi.

In addition, we denote

ξ:=ξ1,ξ2,,ξnn×p,
g(x,ξ):=g1x1,ξ1,g2x2,ξ2,,gnxn,ξnn×p.

In what follows we write gi(k):=gixi(k),ξi(k) and g(k):=g(x(k),ξ(k)) for short.

The inner product of two vectors a, b is written as ⟨a, b⟩. For two matrices A,Bn×p, let A,B:=i=1nAi,Bi, where Ai (respectively, Bi) is the i-th row of A (respectively, B). We use to denote the 2-norm of vectors and the Frobenius norm of matrices.

A graph 𝓖=(𝓝,𝓔) has a set of vertices (nodes) 𝓝={1,2,,n} and a set of edges connecting vertices 𝓔𝓝×𝓝. Suppose agents interact in an undirected graph, i.e., (i,j)ε if and only if (j,i)ε. Each agent i has a set of neighbors 𝓝i={j|ji,(i,j)𝓔}.

Denote the mixing matrix of agents by W=wijn×n. Two agents i and j are connected if and only if wij, wji > 0 (wij = wji = 0 otherwise). Formally, we make the following assumption on the communication among agents.

Assumption 2. The graph 𝓖 is undirected and connected (there exists a path between any two nodes). The mixing matrix W is nonnegative and doubly stochastic, i.e., W1 = 1 and 1W=1, where 1 denotes the vector of all ones.

From Assumption 2, we have the following contraction property of W (see [20]).

Lemma 1. Let Assumption 2 hold, and let ρw denote the spectral norm of the matrix W1n11. Then, ρw < 1 and

Wω1ω¯ρwω1ω¯

for all ωn×p, where ω¯:=1n1ω.

The rest of this paper is organized as follows. We present the DSGD algorithm and some preliminary results in Section II. In Section III we prove the sublinear convergence rate of the algorithm. Our main convergence results and a comparison with the centralized stochastic gradient method are in Section IV. Two numerical example are presented in Section V, and we conclude the paper in Section VI.

II. Distributed Stochastic Gradient Descent

We consider the following DSGD method adapted from DGD and the diffusion strategy [1]: at each step k ≥ 0, every agent i independently performs the update:

xi(k+1)=j𝓝i{i}wijxj(k)αkgj(k), (3)

where αk is a sequence of non-increasing stepsizes. The particular choice of the stepsize sequence will be introduced in Section III. The initial vectors xi(0) are arbitrary for all i𝓝. Since wij = 0 if agent i and agent j are not connected in the network, we can rewrite (3) in the following compact form:

x(k+1)=Wx(k)αkg(k). (4)

Throughout the paper, we make the following standing assumption regarding the objective functions fi.2 These assumptions are satisfied for many machine learning problems, such as linear regression, smooth support vector machine (SVM), logistic regression, and softmax regression.

Assumption 3. Each fi:p is µ-strongly convex with L-Lipschitz continuous gradients, i.e., for any x, xp,

fi(x)fix,xxμxx2,fi(x)fixLxx. (5)

Under Assumption 3, Problem (1) has a unique optimal solution x*p, and the following result holds (see [20] Lemma 10).

Lemma 2. For any xp and α ∈ (0, 2/L), we have

xαf(x)x*λxx*,

where λ=max1αμ,1αL.

Denote g¯(k):=1ni=1ngi(k). The following two results are useful for our analysis.

Lemma 3. Under Assumptions 1 and 3, for all k ≥ 0,

Eg¯(k)¯F(x(k))2|𝓕(k)2ML2n2x(k)1x*2+M¯n,

where

M¯:=2Mi=1nfix*2n+σ2.

Proof. By definitions of g¯(k), ¯F(x(k)) and Assumption 1, we have

Eg¯(k)¯F(x(k))2=E1ni=1ngi(k)1ni=1nfixi(k)2=1n2i=1nEgi(k)fixi(k)2σ2n+Mi=1nfixi(k)2n2.

Notice that fixi(k)2=fixi(k)fix*+fix*22fixi(k)fix*2+2fix*22L2xi(k)x*2+2fix*2 from Assumption 3. We have

Eg¯(k)¯F(x(k))22ML2n2x(k)1x*2+1n2Mi=1nfix*2n+σ2.

Lemma 4. Under Assumption 3, for all k ≥ 0,

f(x¯(k))¯F(x(k))Lnx(k)1x¯(k). (6)

Proof. By definition,

f(x¯(k))¯F(x(k))=1ni=1nfi(x¯(k))1ni=1nfixi(k)(Assumption 3) Lni=1nx¯(k)xi(k)Lnx(k)1x¯(k),

where the last relation follows from Hölder’s inequality. □

A. Preliminary Results

In this section, we present some preliminary results concerning Ex¯(k)x*2 (expected optimization error) and Ex(k)1x¯(k)2 (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration.

For ease of presentation, for all k we denote

U(k):=Ex¯(k)x*2,V(k):=Ex(k)1x¯(k)2.

In the lemma below, we bound the optimization error U (k +1) by several error terms at iteration k, including the consensus error V (k). It serves as a starting point for the follow-up analysis.

Lemma 5. Suppose Assumptions 1–3 hold. Under Algorithm (4), supposing αk1L, we have

U(k+1)1αkμ2U(k)+2αkLn1αkμEx¯(k)x*x(k)1x¯(k)+αk2L2nV(k)+αk22ML2n2Ex(k)1x*2+M¯n. (7)

Proof. By the definitions of x¯(k), g¯(k) and relation (4), we have x¯(k+1)=x¯(k)αkg¯(k). Hence,

x¯(k+1)x*2=x¯(k)αkg¯(k)x*2=x¯(k)αk¯F(x(k))x*+αk¯F(x(k))αkg¯(k)2=x¯(k)αk¯F(x(k))x*2+αk2¯F(x(k))g¯(k)2+2αkx¯(k)αk¯F(x(k))x*,¯F(x(k))g¯(k).

Noting that E[g¯(k)|x(k)]=¯F(x(k)) and Eg¯(k)¯F(x(k))2|𝓕(k)2ML2n2x(k)1x*2+M¯n from Lemma 3, we obtain

Ex¯(k+1)x*2|𝓕(k)x¯(k)αk¯F(x(k))x*2+αk22ML2n2x(k)1x*2+M¯n. (8)

We next bound the first term on the right-hand-side of (8).

x¯(k)αk¯F(x(k))x*2=x¯(k)αkf(x¯(k))x*+αkf(x¯(k))αk¯F(x(k))2x¯(k)αkf(x¯(k))x*2+2αkx¯(k)αkf(x¯(k))x*f(x¯(k))¯F(x(k))+αk2f(x¯(k))¯F(x(k))2,

where we used the Cauchy-Schwarz inequality. By Lemma 3,

f(x¯(k))¯F(x(k))2L2nx(k)1x¯(k)2.

Since αk1L, in light of Lemma 2,

x¯(k)αkf(x¯(k))x*21αkμ2x¯(k)x*2.

Then we have

x¯(k)αk¯F(x(k))x*21αkμ2x¯(k)x*2+2αkLn1αkμx¯(k)x*x(k)1x¯(k)+αk2L2nx(k)1x¯(k)2. (9)

In light of relation (9), taking full expectation on both sides of relation (8) yields the result. □

The next result is a corollary of Lemma 5 with an additional condition on the stepsize αk. We are able to remove the cross term in relation (7) and obtain a cleaner expression, which facilitates our later analysis.

Lemma 6. Suppose Assumptions 1–3 hold. Under Algorithm (4), supposing αkmin1L,13μ, then

U(k+1)132αkμU(k)+3αkL2nμV(k)+αk22ML2n2Ex(k)1x*2+M¯n. (10)

Proof. From Lemma 5,

U(k+1)1αkμ2U(k)+1αkμ2cU(k)+αk2L2n1cV(k)+αk2L2nV(k)+αk22ML2n2Ex(k)1x*2+M¯n(1+c)1αkμ2U(k)+1+1cαk2L2nV(k)+αk22ML2n2Ex(k)1x*2+M¯n,

where c > 0 is arbitrary.

Take c=38αkμ. Noting that αk13μ, we have 1+c1αkμ2132αkμ, and 1+1cαk3μ. Thus,

U(k+1)132αkμU(k)+3αkL2nμV(k)+αk22ML2n2Ex(k)1x*2+M¯n.

Since the consensus error term V (k) plays a key role in the statements of Lemma 5 and Lemma 6, we present the following lemma that bounds V (k + 1).

Lemma 7. Suppose Assumptions 1–3 hold. Under Algorithm (4), for all k ≥ 0,

V(k+1)3+ρw24V(k)+αk2ρw2nσ2+2αk2ρw231ρw2+ML2Ex(k)1x*2+F1x*2. (11)

Proof. From relation (4),

x(k+1)1x¯k+1=Wx(k)αkg(k)1x¯(k)αkg¯(k)=W11nx(k)1x¯(k)αkg(k)1g¯(k),

we have

x(k+1)1x¯(k+1)2ρw2x(k)1x¯(k)αkg(k)1g¯(k)2=ρw2x(k)1x¯(k)2+αk2g(k)1g¯(k)22αkx(k)1x¯(k),g(k)1g¯(k).

Since E[g(k)|𝓕(k)]=F(x(k)) and E[g¯(k)𝓕(k)]=¯F(x(k)),

Ex(k)1x¯(k),g(k)1g¯(k)|𝓕(k)=x(k)1x¯(k),F(x(k))1¯F(x(k)),

and

Eg(k)1g¯(k)2|𝓕(k)=EF(x(k))1¯F(x(k))F(x(k))+1¯F(x(k))+g(k)1g¯(k)2||𝓕(k)=F(x(k))1¯F(x(k))2+EF(x(k))g(k)1¯F(x(k))1g¯(k)2|𝓕(k)F(x(k))1¯F(x(k))2+EF(x(k))g(k)2|𝓕(k)F(x(k))1¯F(x(k))2+nσ2+MF(x(k))2,

where the last inequality follows from Assumption 1. Therefore (assuming ρw > 0),

1ρw2Ex(k+1)1x¯(k+1)2|𝓕(k)x(k)1x¯(k)2+αk2F(x(k))1¯F(x(k))2+αk2nσ2+MF(x(k))22αkx(k)1x¯(k),F(x(k))1¯F(x(k))x(k)1x¯(k)2+αk2(1+M)F(x(k))2+αk2nσ2+2αkx(k)1x¯(k)F(x(k))(1+c)x(k)1x¯(k)2+αk2(1+M)F(x(k))2+αk2nσ2+αk2cF(x(k))2,

where c > 0 is arbitrary. Letting c=1ρw22 and noting that by Assumption 3,

F(x(k))22F(x(k))F1x*2+2F1x*22L2x(k)1x*2+2F1x*2,

we have

1ρw2Ex(k+1)1x¯(k+1)2|𝓕(k)αk2nσ23ρw22x(k)1x¯(k)2+2αk231ρw2+ML2x(k)1x*2+F1x*2.

Notice that ρw23ρw223+ρw24. In light of Lemma 8, taking full expectation on both sides of the above inequality leads to

Ex(k+1)1x¯(k+1)2αk2ρw2nσ23+ρw24Ex(k)1x¯(k)2+2αk2ρw231ρw2+ML2Ex(k)1x*2+F1x*2.

III. Analysis

We are now ready to derive some preliminary convergence results for Algorithm (4). First, we provide a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all k ≥ 0. Then based on the lemmas established in Section II-A, we prove the sublinear convergence rate for Algorithm (4), i.e., U(k)=𝓞1k and V(k)=𝓞1k2. These results provide the foundation for our main convergence theorems in Section IV.

From now on we consider the following stepsize policy:

αk:=θμ(k+K),k, (12)

where constant θ > 1, and

K:=2θ(1+M)L2μ2, (13)

with denoting the ceiling function.

A. Uniform Bound

We first derive a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all k ≥ 0. Such a result is helpful for bounding the error terms on the right hand sides of (7), (10) and (11).

Lemma 8. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize policy (12), for all k ≥ 0, we have

Ex(k)1x*2X^:=maxx(0)1x*2,9F1x*2μ2+nσ2(1+M)L2. (14)

Proof. The following arguments are inspired by those in [35].

First, we bound Exi(k)αkgi(k)2 for all i𝓝 and k ≥ 0. By Assumption 1,

Exi(k)αkgi(k)2|𝓕(k)=xi(k)αkfixi(k)2+αk2Efixi(k)gi(k)2|𝓕(k)xi(k)22αkfixi(k),xi(k)+αk2fixi(k)2+αk2σ2+Mfixi(k)2.

From the strong convexity and Lipschitz continuity of fi, we know that

fixi(k),xi(k)=fixi(k)fi(0),xi(k)0+fi(0),xi(k)μxi(k)2+fi(0),xi(k),

and

fixi(k)2=fixi(k)fi(0)+fi(0)22L2xi(k)2+2fi(0)2.

Hence,

Exi(k)αkgi(k)2|𝓕(k)xi(k)22αkμxi(k)2+fi(0),xi(k)+2αk2(1+M)L2xi(k)2+fi(0)2+αk2σ2xi(k)22αkμxi(k)2+2αkfi(0)xi(k)+2αk2(1+M)L2xi(k)2+fi(0)2+αk2σ212αkμ+2αk2(1+M)L2xi(k)2+2αkfi(0)xi(k)+αk22(1+M)fi(0)2+σ2.

Taking full expectation on both sides, it follows that

Exi(k)αkgi(k)212αkμ+2αk2(1+M)L2Exi(k)2+2αkfi(0)Exi(k)2+αk22(1+M)fi(0)2+σ2.

From the definition of K in (13), αkμ2(1+M)L2 for all k ≥ 0. Hence,

Exi(k)αkgi(k)21αkμExi(k)2+2αkfi(0)Exi(k)2+αk22(1+M)fi(0)2+σ2Exi(k)2αkμExi(k)22fi(0)Exi(k)2μ2L22fi(0)2+σ2(1+M). (15)

Let us define the following set:

𝓧i:=q0:μq2fi(0)qμ2L22fi(0)2+σ2(1+M)0, (16)

which is non-empty and compact. If Exi(k)2𝓧i, we know from inequality (15) that Exi(k)αkgi(k)2Exi(k)2. Otherwise,

Exi(k)αkgi(k)2maxq𝓧iqμ2(1+M)L2μq2fi(0)qμ2L22fi(0)2+σ2(1+M)=maxq𝓧i1μ22(1+M)L2q+μ(1+M)L2fi(0)q+μ24(1+M)L42fi(0)2+σ2(1+M).

Define the last term above as Ri. The previous arguments imply that for all k ≥ 0,

Exi(k)αkgi(k)2maxExi(k)2,Ri.

Note that from relation (4),

x(k+1)2W2x(k)αkg(k)2x(k)αkg(k)2.

We have

Ex(k)2maxx(0)2,i=1nRi. (17)

We further bound Ri as follows. From the definition of 𝓧i,

maxq𝓧iq8fi(0)2μ2+3σ24(1+M)L2.

Hence,

Ri=maxq𝓧iqμ2(1+M)L2μq2fi(0)qμ2L22fi(0)2+σ2(1+M)maxq𝓧iqμ2(1+M)L2minq𝓧μq2fi(0)qμ2L22fi(0)2+σ2(1+M)8fi(0)2μ2+3σ24(1+M)L2+μ2(1+M)L2fi(0)2μ+μ2L22fi(0)2+σ2(1+M)9fi(0)2μ2+σ2(1+M)L2. (18)

In light of inequality (18), further noticing that the choice of 0 is arbitrary in the proof of (17), we obtain the uniform bound for Ex(k)1x*2 in (14). □

The uniform bound provided in Lemma 8 is critical for deriving the sublinear convergence rates of U (k) and V (k), as it holds for all k ≥ 0.

B. Sublinear Rate

With the help of Lemma 6 and Lemma 7 from Section II-A and Lemma 8, we show in Lemma 10 below that Algorithm (4) enjoys the sublinear convergence rate, i.e., U(k)=𝓞1k and V(k)=𝓞1k2. For the ease of analysis, we define two auxiliary variables:

U˜(k):=U(kK),V˜(k):=V(kK),kK. (19)

We first derive uniform upper bounds for U (k) and V (k) respectively for all k ≥ 0 based on Lemma 8. With these bounds, we are able to characterize the constants appearing in the sublinear convergence rates for U (k) and V (k) in Lemma 10 and Lemma 12 respectively.

Lemma 9. Suppose Assumptions 1–3 hold. Under Algorithm (4), we have

U(k)X^n,V(k)X^,k0. (20)

Proof. By definitions of U (k), V (k), and Lemma 8, we have

U(k)=Ex¯(k)x*21nEx(k)1x*2X^n,V(k)=Ex(k)1x¯(k)2Ex(k)1x*2X^.

Denote an auxiliary counter

k˜:=k+K,k0. (21)

Our strategy is to first show that the consensus error of Algorithm (4) decays as V(k)=𝓞1k2 based on Lemma 7, since U (k) does not appear explicitly in relation (11).

Lemma 10. Suppose Assumptions 1–3 hold. Let

K1:=max2K,161ρw2. (22)

Under Algorithm (4) with stepsize (12), for all k ≥ K1K, we have

V(k)V^k˜2,

where

V^:=maxK12X^,8θ2ρw2c1μ21ρw2, (23)

with

c1:=231ρw2+ML2X^+F1x*2+nσ2. (24)

Proof. From Lemma 7 and Lemma 8, for k ≥ 0,

V(k+1)3+ρw24V(k)+αk2ρw2c1, (25)

with c1 defined in (24). From the definitions of αk and V˜(k) in (12) and (19) respectively, we know that when kK,

V˜(k+1)3+ρw24V˜(k)+θ2ρw2c1μ21k2.

We now prove the lemma by induction. For k = K1, we know V˜(k)=K12V˜K1K12V^k2 from Lemma 9. Now suppose V˜(k)V^k2 for some kK1, then

V˜(k+1)3+ρw24V^k2+θ2ρw2c1μ21k2.

To show that V˜(k+1)V^(k+1)2, it is sufficient to show

3+ρw24V^k2+θ2ρw2c1μ21k2V^(k+1)2,

or equivalently,

V^θ2ρw2c1μ2kk+123+ρw241. (26)

Since kK1161ρw2, we have

kk+123+ρw24=2k+1+1(k+1)2+1ρw241ρw28.

Hence relation (26) is satisfied with V^8θ2ρw2c1μ21ρw2. We then have for all kK1,

V˜(k)1k2maxK12X^,8θ2ρw2c1μ21ρw2.

Recalling the connection of V˜(k) and V (k), we conclude that V(k)V^k˜2 for all kK1K. □

To prove the sublinear convergence of U (k), we start with a useful lemma which provides lower and upper bounds for the product of a decreasing sequence. Such products arise in the convergence proof for U (k) and our main convergence results in Section IV.

Lemma 11. For any 1<a<k(a) and 1 < γa/2,

a2γk2γt=ak11γtaγkγ.

Proof. Denote G(k):=t=ak11γt. We first show that G(k)aγkγ. Suppose G(k)M1kγ for some M1 > 0 and ka. Then,

G(k+1)=1γkG(k)1γkM1kγM1(k+1)γ.

To see why the last inequality holds, note that kk+1γ1γk. Taking M1 = aγ, we have G(a)=1M1aγ. The desired relation then holds for all k > a.

Now suppose G(k)M2k2γ for some M2 > 0 and ka. It follows that

G(k+1)=1γkG(k)1γkM2k2γM2(k+1)2γ,

where the last inequality follows from kk+12γ1γk (noting that γa/2k/2). Taking M2=a2γ, we have G(a)=1M2a2γ. The desired relation then holds for all k > a. □

In light of Lemma 6 and the other supporting lemmas, we establish the 𝓞1k convergence rate of U (k) in the following lemma.

Lemma 12. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12), suppose θ > 2. We have

U(k)θ2c2(1.5θ1)nμ2k˜+K11.5θk˜1.5θX^n+3θ2(1.5θ1)c2(1.5θ2)nμ2+6θL2V^(1.5θ2)nμ21k˜2,

for all k ≥ K1 − K, where

c2:=2ML2nX^+M¯. (27)

Proof. In light of Lemma 6 and Lemma 8, for all k ≥ 0, we have

U(k+1)132αkμU(k)+3αkL2nμV(k)+αk2c2n.

Recalling the definitions of U˜(k) and V˜(k), for all kK,

U˜(k+1)13θ2kU˜(k)+3θL2nμ2V˜(k)k+θ2c2nμ21k2.

Therefore,

U˜(k)t=K1k113θ2tU˜K1+t=K1k1j=t+1k113θ2jθ2c2nμ21t2+3θL2nμ2V˜(t)t.

From Lemma 11,

U˜(k)K11.5θk1.5θU˜K1+t=K1k1(t+1)1.5θk1.5θθ2c2nμ2t2+3θL2nμ2V˜(t)t=1k1.5θθ2c2nμ2t=K1k1(t+1)1.5θt2+K11.5θk1.5θU˜K1+t=K1k1(t+1)1.5θk1.5θ3θL2nμ2V˜(t)t.

In light of Lemma 10, when kK1, V˜(k)V^k2. Hence,

U˜(k)1k1.5θθ2c2nμ2t=K1k1(t+1)1.5θt2K11.5θk1.5θU˜K1t=K1k1(t+1)1.5θk1.5θ3θL2nμ2V^t3=1k1.5θ3θL2V^nμ2t=K1k1(t+1)1.5θt3.

However, we have for any b > aK1,

ab(t+1)1.5θt2ab2(t+1)1.5θ(t+1)2+3(t+1)1.5θ(t+1)3+b1.5θ(b1)2+(b+1)1.5θb2abt1.5θ2+3t1.5θ3dt+2(b+1)1.5θb2b1.5θ11.5θ1+3b1.5θ21.5θ2+3b1.5θ2,

where the last inequality comes from the fact that b+1b1.5θ4θ+14θ1.5θexp38<32 (given that b > K1 ≥ 4θ), and

ab(t+1)1.5θt332abt1.5θ332ab+1t1.5θ3dt32(b+1)1.5θ21.5θ22b1.5θ21.5θ2.

Hence, for all k ≥ K1,

U˜(k)θ2c2(1.5θ1)nμ2k+3θ2(1.5θ1)c2(1.5θ2)nμ21k2+K11.5θk1.5θU˜K1+6θL2V^(1.5θ2)nμ21k2.

Recalling Lemma 9 and the definition of U˜(k) yields the desired result. □

Remark 1. Notice that the convergence rate established in Lemma 12 is not asymptotically the same as centralized stochastic gradient descent, since the constant c2 contains information about the initial solutions. In the next section, we will improve the convergence result and show that DSGD indeed performs as well as centralized SGD asymptotically.

IV. Main Results

In this section, we perform a non-asymptotic analysis of network independence for Algorithm (4). Specifically, in Theorem 1, we show that

1ni=1nExi(k)x*2=θ2M¯(2θ1)nμ2k˜+𝓞1n1ρw1k˜1.5+𝓞11ρw21k˜2,

where the first term is network independent and the second and third (higher-order) terms depends on 1ρw. Then we compare the result with centralized stochastic gradient descent and show that asymptotically, the two methods have the same convergence rate θ2M¯(2θ1)nμ2k˜. In addition, it takes KT=𝓞n1ρw2 time for Algorithm (4) to reach this asymptotic rate of convergence. Finally, we construct a “hard” optimization problem for which we show the transient time KT is sharp.

Our first step is to simplify the presentation of the convergence results in Lemma 10 and Lemma 12, so that we can utilize them for deriving improved convergence rates conveniently. For this purpose, we first estimate the constants X^, V^, c1 and c2 appearing in the two lemmas and derive their dependency on the network size n, the spectral gap 1ρw, the summation of initial optimization errors i=1nxi(0)x*2, and i=1nfix*2, where the last term can be seen as a measure of the difference among each agent’s individual cost functions.

Lemma 13. Denote A:=i=1nxi(0)x*2 and B:=i=1nfix*2. Then,

X^=𝓞(A+B+n),c1=𝓞A+B+n1ρw,V^=𝓞A+B+n1ρw2,c2=𝓞A+B+nn.

Proof. We first estimate the constant X^ which appears in the definition (23) for V^. From Lemma 8,

X^x(0)1x*2+9F1x*2μ2+nσ2(1+M)L2=𝓞(A+B+n).

From the definition of c1 in (24),

c1=231ρw2+ML2X^+F1x*2+nσ2=𝓞A+B+n1ρw.

Noting that K1=𝓞11ρw, by definition,

V^=maxK12X^,8θ2ρw2c1μ21ρw2=𝓞A+B+n1ρw2.

From the definition of c2 in (27), we have

c2=2ML2nX^+M¯=𝓞A+B+nn.

In light of Lemma 13, the convergence result of V (k) given in Lemma 10 can be easily simplified since V^ is the only constant. Regarding the optimization error U (k), in light of Lemma 8, Lemma 10, Lemma 12 and Lemma 13, we have the following corollary which simplifies the presentation of the convergence result in Lemma 12.

Corollary 1. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12) and assuming θ > 2, when kK1K,

U(k)θ2c2(1.5θ1)nμ2k˜+ck˜2,

where

c=𝓞A+B+nn1ρw2.

Proof. From Theorem 12 and Lemma 13, when kK1K=𝓞11ρw,

U(k)θ2c2(1.5θ1)nμ2k˜+K11.5θ2k˜1.5θ2X^nK12k˜2+3θ2(1.5θ1)c2(1.5θ2)nμ2+6θL2V^(1.5θ2)nμ21k˜2=θ2c2(1.5θ1)nμ21k˜+𝓞A+B+nn1ρw21k˜2. (28)

Let 1ni=1nExi(k)x*2, the average optimization error for each agent to measure the performance of DSGD. In the following theorem, we improve the result of Corollary 1 with further analysis and derive the main convergence result for Algorithm (4).

Theorem 1. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12) and assuming θ > 2, when kK1K,

1ni=1nExi(k)x*2θ2M¯(2θ1)nμ2k˜+𝓞A+B+nn1ρw1k˜1.5+𝓞A+B+nn1ρw21k˜2. (29)

Proof. For k ≥ K1 −K, in light of Lemma 2 and Lemma 5,

U(k+1)1αkμ2U(k)+2αkLnEx¯(k)x*x(k)1x¯(k)+αk2L2nV(k)+αk22ML2n2Ex(k)1x*2+M¯n1αkμ2U(k)+2αkLnU(k)V(k)+αk2L2nV(k)+αk22ML2n2(nU(k)+V(k))+M¯n=12αkμU(k)+αk2μ2+2ML2nU(k)+2αkLnU(k)V(k)+αk2L2n1+2MnV(k)+αk2M¯n,

where the second inequality follows from the Cauchy-Schwarz inequality (see [52], page 62) and the fact that x(k)1x*2=x(k)1x¯(k)2+nx¯(k)x*2.

Recalling the definitions of U˜(k) and V˜(k), when kK1,

U˜(k+1)12θkU˜(k)+θ2k21+2ML2nμ2U˜(k)+2θLnμU˜(k)V˜(k)k+θ2L2nμ21+2MnV˜(k)k2+θ2M¯nμ21k2.

Therefore, by denoting c3:=1+2ML2nμ2 and c4:=1+2Mn, we have

U˜(k)t=K1k112θtU˜K1+t=K1k1i=t+1k112θiθ2M¯nμ2t2+θ2c3U˜(t)t2+2θLnμU˜(t)V˜(t)t+θ2L2c4nμ2V˜(t)t2

From Lemma 11,

U˜(k)K12θk2θU˜K1+t=K1k1(t+1)2θk2θθ2M¯nμ2t2+θ2c3U˜(t)t2+2θLnμU˜(t)V˜(t)t+θ2L2c4nμ2V˜(t)t2=1k2θθ2M¯nμ2t=K1k1(t+1)2θt2+K12θk2θU˜K1+t=K1k1(t+1)2θk2θθ2c3U˜(t)t2+2θLnμU˜(t)V˜(t)t+θ2L2c4nμ2V˜(t)t2.

Hence, by Corollary 1,

U˜(k)1k2θθ2M¯nμ2t=K1k1(t+1)2θt2K12θk2θU˜K1θ2c3k2θt=K1k1(t+1)2θt2θ2c2(1.5θ1)nμ2t+ct2+1k2θ2θLnμt=K1k1(t+1)2θtθ2c2(1.5θ1)nμ21t+ct2V^t2+1k2θθ2L2c4nμ2t=K1k1(t+1)2θt2V^t2.

Since

θ2c2(1.5θ1)nμ21t+ct2V^t2θ2c2V^(1.5θ1)nμ21t3+cV^t2,

we have

U˜(k)1k2θθ2M¯nμ2t=K1k1(t+1)2θt2K12θk2θU˜K1θ2c3k2θt=K1k1(t+1)2θt2θ2c2(1.5θ1)nμ21t+ct2+1k2θ2θLnμt=K1k1(t+1)2θtθ2c2V^(1.5θ1)nμ21t3+cV^t2+1k2θθ2L2c4V^nμ2t=K1k1(t+1)2θt4=1k2θ2θ2Lσc2V^1.5θ1nμ2t=K1k1(t+1)2θt2.5+1k2θθ4c3c2(1.5θ1)nμ2+2θLcV^nμt=K1k1(t+1)2θt3+1k2θθ2c3c+θ2L2c4V^nμ2t=K1k1(t+1)2θt4.

Notice that c2=𝓞A+B+nn and c3,c4=𝓞(1). Following a discussion similar to those in the proofs for Theorem 12 and Corollary 1, we have

U˜(k)θ2M¯(2θ1)nμ2k+𝓞A+B+nn1ρw1k1.5+𝓞A+B+nn1ρw21k2+𝓞A+B+nn1ρw21k3+𝓞A+B+nn1ρw2θ1k2θ=θ2M¯(2θ1)nμ2k+𝓞A+B+nn1ρw1k1.5+𝓞A+B+nn1ρw21k2.

Noting that

1ni=1nExi(k)x*2=Ex¯(k)x*2+1ni=1nExi(k)x¯2=U(k)+V(k)n,

and U(k)=U˜(k+K), in light of the bound on V(k) in Lemma 10 and the estimate of V^ in Lemma 13, we obtain the desired result. □

A. Comparison with Centralized Implementation

We compare the performance of DSGD and centralized stochastic gradient descent (SGD) stated below:

x(k+1)=x(k)αkg˜(k), (30)

Where αk:=θμk(θ>1) and g˜(k):=1ni=1ngx(k),ξi(k).

First, we derive the convergence rate for SGD which matches the optimal rate for such stochastic gradient methods (see [29], [53]). Our result relies on an analysis different from the literature that considered a compact feasible set and uniformly bounded stochastic gradients in expectation.

Theorem 2. Under the centra,lize,d stochastic gradient descent of (30), suppose kK2:=θLμ. We have

Ex(k)x*2θ2M¯(2θ1)nμ2k+𝓞1n1k2.

Proof. Noting that αk ≤ 1/L when k ≥ K2, we have from Lemma 3 that

Ex(k+1)x*2|𝓕(k)=Ex(k)αkg˜(k)x*2|𝓕(k)=x(k)αkf(x(k))x*2+αk2Ef(x(k))g˜(k)21αkμ2x(k)x*2+αk22ML2nx(k)x*2+M¯n=12θkx(k)x*2+θ21+2ML2nμ2x(k)x*21k2+θ2M¯nμ21k2. (31)

It can be shown first that Ex(k)x*2c5k for k ≥ K2, where c5=𝓞1n.3 Denote c¯5:=1+2ML2nμ2c5. Then from relation (31), when k ≥ K2,

Ex(k)x*2t=K2n112θtExK2x*2+t=K2k1i=t+1k112θiθ2M¯nμ2t2+θ2c¯5t3.

From Lemma 11,

Ex(k)x*2K22θk2θExK2x*2+t=K2k1(t+1)2θk2θθ2M¯nμ2t2+θ2c¯5t3=1k2θθ2M¯nμ2t=K2k1(t+1)2θt2+K22θk2θExK2x*2+θ2c¯5k2θt=K2k1(t+1)2θt3=θ2M¯(2θ1)nμ2k+𝓞1n1k2.

Comparing the results of Theorem 1 and Theorem 2, we can see that asymptotically, DSGD and SGD have the same convergence rate θ2M¯(2θ1)nμ2k. The next corollary identifies the time needed for DSGD to achieve this rate.

Corollary 2 (Transient Time). Suppose Assumptions 1–3 hold. Assume in addition that i=1nxi(0)x*2=𝓞(n) and i=1nfix*2=𝓞(n). It takes KT=𝓞n1ρw2 time for Algorithm (4) to reach the asymptotic rate of convergence, i.e., when kKT, we have 1ni=1nExi(k)x*2θ2M¯(2θ1)nμ2k𝓞(1).

Proof. From (29),

1ni=1nExi(k)x*2θ2M¯(2θ1)nμ2k1+𝓞n1ρw1k0.5+𝓞n1ρw21k.

Let KT be such that

𝓞n1ρw1KT0.5+𝓞n1ρw21KT=𝓞(1).

We then obtain that

KT=𝓞n1ρw2.

Remark 2. By assuming the additional conditions i=1nxi(0)x*2=𝓞(n) and i=1nfix*2=𝓞(n), motivated by the observation that each of these expression is the sum of n terms, we obtain a cleaner expression of the transient time. In general, we would obtain KT=𝓞A+B+n1ρw2.

Remark 3. For general connected networks such as line graphs, if we adopt the Lazy Metropolis rule for choosing the weights [wij] (see [54]), then 11ρw=𝓞n2, and hence KT=𝓞n5. The transient time can be improved for networks with special structures. For example, 11ρw is constant with high probability for a random Erdős-Rényi random graph, and consequently KT=𝓞(n) on such a graph.

The next theorem states that the transient time for DSGD to reach the asymptotic convergence rate is lower bounded by Ωn1ρw2, that is, under Assumptions 1–3 and assuming i=1nxi(0)x*2=𝓞(n) and i=1nfix*2=𝓞(n), there exists an optimization problem whose transient time under DSGD is lower bounded by Ωn1ρw2. This implies the result in Corollary 2 is sharp and can not be improved in general.

Theorem 3. Suppose Assumptions 1–3 hold. Assume in addition that i=1nxi(0)x*2=𝓞(n) and i=1nfix*2=𝓞(n). Then there exists a ρ0 ∈ (0, 1) such that if ρwρ0, then the time needed for DSGD to rea ch the a symptotic convergence rate is lower bounded by Ωn1ρw2.

Proof. We construct a “hard” optimization to prove the claimed result, inspired by [31]. Consider quadratic objective functions fi(x):=12xxi*2, where x, xi*. The optimal solution to Problem (1) is given by x*=1ni=1nxi*. The DSGD algorithm implements:

x(k+1)=Wx(k)αkx(k)x*+αkn(k), (32)

where x*:=x1*,x2*,,xn*, and n(k) denotes the vector of gradient noise terms. From (12), we use stepsize αk=θk+Kθ>2, where K=2θ since µ = L = 1. We rewrite (32) as

x(k+1)=1αkWx(k)+αkWx*+αkWn(k).

It follows that

x(k+1)1x¯(k+1)=1αkWx(k)1x¯(k)+αkWx*1x*+αkW(n(k)1n¯(k)).

By induction, we have for all k > 0,

x(k)1x¯k=t=0k11αtWk(x(0)1x¯(0))+t=0k1j=t+1k11αjαtWktx*1x*+(n(t)1n¯(t)). (33)

Assume that: (i) the matrix W is symmetric; (ii) Wx* = ρwx*, i.e., x* is an eigenvector of W w.r.t. eigenvalue ρw (hence x*=1n1x*=0); (iii)F1x*2=1x*x*2=x*2=Ω(n); (iv) x(0) = x*.4 Then x¯(0)=x*=0, and from relation (33) it follows

x(k)1x¯k=t=0k11αtρwkx*+t=0k1j=t+1k11αjαtρwktx*+ϵ(k), (34)

where ϵ(k) captures the random perturbation caused by gradient noise that has mean zero. Therefore,

Ex(k)1x¯(k)2t=0k1j=t+1k11αjαtρwktx*2.

Recalling the definition V(k)=Ex(k)1x¯(k)2 and V˜(k)=V(kK), and noticing that αk=θk+K, we have

V˜(k)t=Kk1j=t+1k11θjθtρwkt2x*2t=Kk1(t+1)2θk2θθtρwkt2x*2,

where we invoked Lemma 11 for the second inequality. Then,

V˜(k)θρwkk2θt=Kk1(t+1)2θ1ρwt2x*2θρwkk2θt=K1k1(t+1)2θ1ρwtdt2x*2. (35)

Note that when k4θlnρw,

t=K1k1(t+1)2θ1ρwtdt23(t+1)2θ1lnρwρwtt=K1k1=2k2θ13lnρwρwk12K2θ13lnρwρwK1k2θ12lnρwρwk1.

From (35),

V˜(k)θρw2lnρwk2x*2=Ωn1ρw21k2,

where the equality is obtained from the Taylor expansion of ln ρw when ρw → 1. Since

1ni=1nExi(k)x*2=U(k)+V(k)nΩ11ρw21k2,

setting this to be at most M¯nk, we obtain that the transient time for DSGD to re ach the asymptotic convergence rate is lower bounded by Ωn1ρw2, based on an argument similar to that of Corollary 2. □

V. Numerical Examples

In this section, we provide two numerical example to verify and complement our theoretical findings.

A. Ridge Regression

Consider the on-line ridge regression problem, i.e.,

minxpf(x)=1ni=1nfi(x)=Eui,viuixvi2+ρx2, (36)

where ρ > 0 is a penalty parameter. Each agent i collects data samples in the form of (ui, vi) continuously where uip represent the features and vi are the observed outputs. Assume each ui ∈ [−0.5, 0.5]p is uniformly distributed, and vi is drawn according to vi=uix˜i+εi, where x˜i are predefined parameters evenly located in [0, 10]p, and εi are independent Gaussian random variables (noise) with mean 0 and variance 0.01. Given a pair (ui, vi), agent i can compute an estimated (unbiased) gradient of fi(x):gix,ui,vi=2uixviui+2ρx. Problem (36) has a unique solution x* given by

x*=i=1nEuiuiui+nρI1i=1nEuiuiuix˜i=1313+ρ11ni=1nx˜i. (37)

Suppose p = 10 and ρ = 1. We compare the performance of DSGD (3) and the centralized implementation (30) for solving problem (36) with the same stepsize policy αk = 20/(k +20), ∀k, and the same initial solutions: xi(0) = 0, ∀i, (DSGD) and x(0) = 0 (SGD). It can be seen from (37) and the definition of x˜i that i=1nxi(0)x*2=𝓞(n). Moreover, fix*=2Eui,viuix*viui+2ρx*=2Euiuiuix*x˜i+2ρx*=23x*x˜i+2ρx*. Therefore, we have i=1nfix*2=𝓞(n).

In Fig. 1, we provide an illustration example that compares the performance of DSGD and SGD, assuming n = 25. For DSGD, we consider two different network topologies: ring network topology as shown in Fig. 2(a) and square grid network topology as shown in Fig. 2(a). For both network topologies, we use Metropolis weights for constructing the maxing matrix W (see [55]). It can be seen that DSGD performs asymptotically as well as SGD, while the time it takes for DSGD to catch up with SGD depends on the network topology. For grid networks which are better connected than rings, the corresponding transient time is shorter.

Fig. 1.

Fig. 1.

The performance comparison between DSGD and SGD for online Ridge regression (n = 25). The results are averaged over 200 Monte Carlo simulation.

Fig. 2.

Fig. 2.

Comparison of the transient times for DSGD and 4n1ρw2 as a function of the network size n for the ring network topology. The expected errors are approximated by averaging over 200 simulation results.

To further verify the conclusions of Corollary 2 and Theorem 3, we define the transient time for DSGD as inf k:1ni=1nExi(k)x*22Ex(k)x*2. For DSGD, we first assume a ring network topology and plot the transient times for DSGD and 4n1ρw2 as a function of the network size n in Fig. 2(b). We then consider a square grid network topology as shown in Fig. 2(a) and plot the transient times for DSGD and 7n1ρw2 in Fig. 2 (b). It can be seen that the two curves in Fig. 2(b) and Fig. 3(b) are close to each other, respectively. This verifies the sharpness of Corollary 2.

Fig. 3.

Fig. 3.

Comparison of the transient times for DSGD and 7n1ρw2 as a function of the network size n for the square grid network topology (n = 4, 9, 16, 25, 36, 49, 64, 81, 100). The expected errors are approximated by averaging over 200 simulation results.

B. Logistic Regression

Consider the problem of classification on the MNIST dataset of handwritten digits (http://yann.lecun.com/exdb/mnist/). In particular, we classify digits 1 and 2 using logistic regression.5 There are 12700 data points in total where each data point is a pair (u, v) with u785 being the image input and v ∈ {0, 1} being the label.6

Suppose each agent i𝓝 possesses a distinct local dataset 𝓢i that is randomly taken from the database. To apply logistic regression for classification, we solve the following optimization problem based on all the agents’ local datasets:

minx785f(x)=1ni=1nfi(x), (38)

where

fi(x):=1𝓢ij𝓢ilog1+expxuj+1vjxuj+λ2x2,

where λ is the regularization parameter.7 Given any solution x, agent i is able to compute an unbiased estimate of fi(x) using one (or a minibatch of) randomly chosen data point (ui, vi) from 𝓢i, that is,

gix,ui,vi=uj1+expxuj+1vjuj+λx.

In the experiments, suppose each local dataset 𝓢i contains 50 data points, and λ = 1. At each iteration of the DSGD algorithm, agent i computes a stochastic gradient of fi(xi(k)) with one randomly chosen data point from 𝓢i. We compare the performance of DSGD (3) and centralized SGD (30) for solving problem (38) with the same stepsize policy αk = 6/(k + 20), ∀k, and the same initial solutions: xi(0) = 0, ∀i, (DSGD) and x(0) = 0 (SGD). It can be numerically verified that i=1nxi(0)x*2=𝓞(n) and i=1nfix*2=𝓞(n).

The transient time for DSGD is defined in the same way as in the ridge regression example. In Fig. 4 and Fig. 5, we plot the transient times for DSGD as a function of the network size n for ring and grid networks, respectively. We find that the curves are close to n41ρw1.5, rather than a multiple of n1ρw2, implying that the experimental results are better than the theoretically derived worst-case performance given in Corollary 2. Hence in practice, the performance of the DSGD algorithm depends on the specific problem instances and can be better than the worst-case situation in terms of transient times.

Fig. 4.

Fig. 4.

Comparison of the transient times for DSGD and n41ρw1.5 as a function of the network size n for the ring network topology. The expected errors are approximated by averaging over 200 simulation results.

Fig. 5.

Fig. 5.

Comparison of the transient times for DSGD and n41ρw1.5 as a function of the network size n for the grid network topology (n = 4, 9, 16, 25, 36, 49, 64, 81, 100). The expected errors are approximated by averaging over 200 simulation results.

VI. Conclusions

This paper is devoted to the non-asymptotic analysis of network independence for the distributed stochastic gradient descent (DSGD) method. We show that in expectation, the algorithm asymptotically achieves the optimal network independent convergence rate compared to SGD, and identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. In addition, we compute the time needed for DSGD to reach its asymptotic rate of convergence and prove the sharpness of the obtained result. Future work will consider more general problems such as nonconvex objectives and constrained optimization. It will also be of interest to explore the transient times of asynchronous distributed stochastic gradient algorithms which enjoy greater flexibility and less communication overhead.

Acknowledgments

This work was partially supported by the NSF under grants IIS-1914792, DMS-1664644, CNS-1645681, and ECCS-1933027, by the ONR under MURI grant N00014-19-1-2571, by the NIH under grants R01 GM135930 and UL54 TR004130, by the DOE under grants DE-AR-0001282 and DE-EE0009696, by the Boston University Kilachand Fund for Integrated Life Science and Engineering, by the Shenzhen Research Institute of Big Data (SRIBD) under grant J00120190011, and by the NSFC under grant 62003287.

Biographies

graphic file with name nihms-1845323-b0001.gif

Shi Pu is currently an assistant professor in the School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. He is also affiliated with Shenzhen Research Institute of Big Data. He received a B.S. Degree from Peking University, in 2012, and a Ph.D. Degree in Systems Engineering from the University of Virginia, in 2016. He was a postdoctoral associate at the University of Florida, from 2016 to 2017, a postdoctoral scholar at Arizona State University, from 2017 to 2018, and a postdoctoral associate at Boston University, from 2018 to 2019. His research interests include distributed optimization, network science, machine learning, and game theory.

graphic file with name nihms-1845323-b0002.gif

Alex Olshevsky received the B.S. degrees in applied mathematics and electrical engineering from Georgia Tech and the Ph.D. degree in EECS from MIT. He is currently an Associate Professor at the ECE department at Boston University. Dr. Olshevsky is a recipient of the NSF CAREER Award, the AFOSR Young Investigator Award, the INFORMS Prize for the best paper on the interface of operations research and computer science, a SIAM Award for annual paper from the SIAM Journal on Control and Optimization chosen to be reprinted in SIAM Review, and an IMIA award for best paper on clinical informatics.

graphic file with name nihms-1845323-b0003.gif

Ioannis Ch. Paschalidis (M’96–SM’06–F’14) received a Ph.D. in EECS from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 1996. He is a Professor at Boston University, Boston, MA and the Director of the Center for Information and Systems Engineering. His research interests lie in the fields of systems and control, optimization, stochastic systems, machine learning, and computational biology and medicine. He is a recipient of the NSF CAREER award, several best paper and best algorithmic performance awards, and a 2014 IBM/IEEE Smarter Planet Challenge Award. He was an invited participant at the 2002 Frontiers of Engineering Symposium, organized by the U.S. National Academy of Engineering and the 2014 U.S. National Academies Keck Futures Initiative (NAFKI) Conference. During 2013–2019 he was the founding Editor-in-Chief of the IEEE Transactions on Control of Network Systems.

Footnotes

1

Note that in [1] this method was called “Adapt-then-Combine”.

2

The assumption can be generalized to the case where the agents have different µ and L.

3

The argument here is similar to that in the proof for Lemma 12.

4

Assumptions (iii) and (iv) correspond to the conditions x(0)1x*2=𝓞(n) and F1x*2=𝓞(n) assumed in the main results such as Theorem 1 and Corollary 2.

5

The problem can be extended to classifying all 10 handwritten digits with multinomial logistic regression.

6

Digit 1 is represented by label 0 and digit 2 is represented by label 1.

7

The obtained optimal solution x* of problem (38) can then be used for predicting the label for any image input u through the decision function h(u):=11+expx*u.

Contributor Information

Shi Pu, School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China..

Alex Olshevsky, Department of Electrical and Computer Engineering and the Division of Systems Engineering, Boston University, Boston, MA.

Ioannis Ch. Paschalidis, Department of Electrical and Computer Engineering and the Division of Systems Engineering, Boston University, Boston, MA

References

  • [1].Chen J and Sayed AH, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012. [Google Scholar]
  • [2].Forrester AI, Sóbester A, and Keane AJ, “Multi-fidelity optimization via surrogate modelling,” Proceedings of the Royal Society of London A, vol. 463, no. 2088, pp. 3251–3269, 2007. [Google Scholar]
  • [3].Nedić A, Olshevsky A, and Uribe CA, “Fast convergence rates for distributed non-bayesian learning,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5538–5553, 2017. [Google Scholar]
  • [4].Cohen K, Nedić A, and Srikant R, “On projected stochastic gradient descent algorithm with weighted averaging for least squares regression,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5974–5981, 2017. [Google Scholar]
  • [5].Baingana B, Mateos G, and Giannakis GB, “Proximal-gradient algorithms for tracking cascades over social networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 4, pp. 563–575, 2014. [Google Scholar]
  • [6].Ying B, Yuan K, and Sayed AH, “Supervised learning under distributed features,” IEEE Transactions on Signal Processing, vol. 67, no. 4, pp. 977–992, 2018. [Google Scholar]
  • [7].Alghunaim SA and Sayed AH, “Distributed coupled multi-agent stochastic optimization,” IEEE Transactions on Automatic Control, 2019. [Google Scholar]
  • [8].Brisimi TS, Chen R, Mela T, Olshevsky A, Paschalidis IC, and Shi W, “Federated learning of predictive models from federated electronic health records,” International journal of medical informatics, vol. 112, pp. 59–67, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Cohen K, Nedić A, and Srikant R, “Distributed learning algorithms for spectrum sharing in spatial random access wireless networks,” IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2854–2869, 2017. [Google Scholar]
  • [10].Mateos G and Giannakis GB, “Distributed recursive least-squares: Stability and performance analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3740–3754, 2012. [Google Scholar]
  • [11].Reisizadeh A, Mokhtari A, Hassani H, and Pedarsani R, “An exact quantized decentralized gradient descent algorithm,” IEEE Transactions on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019. [Google Scholar]
  • [12].Tsitsiklis J, Bertsekas D, and Athans M, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986. [Google Scholar]
  • [13].Nedić A and Ozdaglar A, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009. [Google Scholar]
  • [14].Nedić A, Ozdaglar A, and Parrilo PA, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010. [Google Scholar]
  • [15].Lobel I, Ozdaglar A, and Feijer D, “Distributed multi-agent optimization with state-dependent communication,” Mathematical programming, vol. 129, no. 2, pp. 255–284, 2011. [Google Scholar]
  • [16].Jakovetić D, Xavier J, and Moura JM, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014. [Google Scholar]
  • [17].Kia SS, Cortés J, and Martínez S, “Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication,” Automatica, vol. 55, pp. 254–264, 2015. [Google Scholar]
  • [18].Shi W, Ling Q, Wu G, and Yin W, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. [Google Scholar]
  • [19].Di Lorenzo P and Scutari G, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016. [Google Scholar]
  • [20].Qu G and Li N, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
  • [21].Nedić A, Olshevsky A, and Shi W, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. [Google Scholar]
  • [22].Xu J, Zhu S, Soh YC, and Xie L, “Convergence of asynchronous distributed gradient methods over stochastic networks,” IEEE Transactions on Automatic Control, vol. 63, no. 2, pp. 434–448, 2017. [Google Scholar]
  • [23].Pu S, Shi W, Xu J, and Nedić A, “Push-pull gradient methods for distributed optimization in networks,” IEEE Transactions on Automatic Control, 2020. [Google Scholar]
  • [24].Chen J and Sayed AH, “On the limiting behavior of distributed optimization strategies,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 1535–1542. [Google Scholar]
  • [25].——, “On the learning behavior of adaptive networks—part i: Transient analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3487–3517, 2015. [Google Scholar]
  • [26].——, “On the learning behavior of adaptive networks—part ii: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, 2015. [Google Scholar]
  • [27].Robbins H and Monro S, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951. [Google Scholar]
  • [28].Kiefer J, Wolfowitz J et al. , “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952. [Google Scholar]
  • [29].Nemirovski A, Juditsky A, Lan G, and Shapiro A, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009. [Google Scholar]
  • [30].Srivastava K and Nedic A, “Distributed asynchronous constrained stochastic optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 772–790, 2011. [Google Scholar]
  • [31].Duchi JC, Agarwal A, and Wainwright MJ, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606, 2012. [Google Scholar]
  • [32].Bianchi P and Jakubowicz J, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 391–405, 2013. [Google Scholar]
  • [33].Towfic ZJ and Sayed AH, “Adaptive penalty-based distributed stochastic convex optimization,” Signal Processing, IEEE Transactions on, vol. 62, no. 15, pp. 3924–3938, 2014. [Google Scholar]
  • [34].Chatzipanagiotis N and Zavlanos MM, “A distributed algorithm for convex constrained optimization under noise,” IEEE Transactions on Automatic Control, vol. 61, no. 9, pp. 2496–2511, 2016. [Google Scholar]
  • [35].Nedić A and Olshevsky A, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 61, no. 12, pp. 3936–3947, 2016. [Google Scholar]
  • [36].Sayin MO, Vanli ND, Kozat SS, and Bas T¸ar, “Stochastic subgradient algorithms for strongly convex optimization over distributed networks,” IEEE Transactions on Network Science and Engineering, vol. 4, no. 4, pp. 248–260, 2017. [Google Scholar]
  • [37].Lan G, Lee S, and Zhou Y, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, pp. 1–48, 2017. [Google Scholar]
  • [38].Sirb B and Ye X, “Decentralized consensus algorithm with delayed and stochastic gradients,” SIAM Journal on Optimization, vol. 28, no. 2, pp. 1232–1254, 2018. [Google Scholar]
  • [39].Jakovetic D, Bajovic D, Sahu AK, and Kar S, “Convergence rates for distributed stochastic optimization over random networks,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 4238–4245. [Google Scholar]
  • [40].Xin R, Khan UA, and Kar S, “Variance-reduced decentralized stochastic optimization with gradient tracking,” arXiv preprint arXiv:1909.11774, 2019. [Google Scholar]
  • [41].Morral G, Bianchi P, and Fort G, “Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1476–1481. [Google Scholar]
  • [42].——, “Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2798–2813, 2017. [Google Scholar]
  • [43].Towfic ZJ, Chen J, and Sayed AH, “Excess-risk of distributed stochastic learners,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5753–5785, 2016. [Google Scholar]
  • [44].Pu S and Garcia A, “A flocking-based approach for distributed stochastic optimization,” Operations Research, vol. 1, pp. 267–281, 2018. [Google Scholar]
  • [45].——, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. [Google Scholar]
  • [46].Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5336–5346.
  • [47].Assran M, Loizou N, Ballas N, and Rabbat M, “Stochastic gradient push for distributed deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 344–353. [Google Scholar]
  • [48].Pu S and Nedić A, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021. [Google Scholar]
  • [49].Spiridonoff A, Olshevsky A, and Paschalidis IC, “Robust asynchronous stochastic gradient-push: Asymptotically optimal and network-independent performance for strongly convex functions.” Journal of Machine Learning Research, vol. 21, no. 58, pp. 1–47, 2020. [PMC free article] [PubMed] [Google Scholar]
  • [50].Koloskova A, Stich S, and Jaggi M, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in International Conference on Machine Learning, 2019, pp. 3478–3487. [Google Scholar]
  • [51].Pu S, Olshevsky A, and Paschalidis IC, “Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent,” IEEE signal processing magazine, vol. 37, no. 3, pp. 114–122, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Williams D, Probability with martingales Cambridge university press, 1991. [Google Scholar]
  • [53].Rakhlin A, Shamir O, and Sridharan K, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 1571–1578. [Google Scholar]
  • [54].Olshevsky A, “Linear time average consensus and distributed optimization on fixed graphs,” SIAM Journal on Control and Optimization, vol. 55, no. 6, pp. 3990–4014, 2017. [Google Scholar]
  • [55].Nedić A, Olshevsky A, and Rabbat MG, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018. [Google Scholar]

RESOURCES