Abstract
This paper is concerned with minimizing the average of n cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, in expectation, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate. Moreover, we construct a “hard” optimization problem that proves the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.
Index Terms—: distributed optimization, convex optimization, stochastic programming, stochastic gradient descent
I. Introduction
We consider the distributed optimization problem where a group of agents collaboratively seek an that minimizes the average of n cost functions:
| (1) |
Each local cost function is strongly convex, with Lipschitz continuous gradient, and is known by agent i only. The agents communicate and exchange information over a network. Problems in the form of (1) find applications in multi-agent target seeking [1], distributed machine learning [2], [3], [4], [5], [6], [7], [8], and wireless networks [9], [10], [5], among other scenarios.
In order to solve (1), we assume that at each iteration k ≥ 0, the algorithm we study is able to obtain noisy gradient estimates gi(xi(k), ξi(k)), where xi(k) is the input for agent i, that satisfy the following condition.
Assumption 1. For all k ≥ 0, each random vector is independent across . Denote by the σ-algebra generated by . Then,
| (2) |
Stochastic gradients appear in many machine learning problems. For example, suppose represents the expected loss function for agent i, where ξi are independent data samples gathered over time, and represents the data distribution. Then for any x and ξi sampled from , is an unbiased estimator of ∇fi(x). For another example, suppose denotes an empirical risk function, where is the local dataset for agent i. In this setting, the gradient estimation of fi(x) can incur noise from various sources such as minibatch random sampling of the local dataset and discretization for reducing communication cost [11].
Problem (1) has been studied extensively in the literature under various distributed algorithms [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], among which the distributed gradient descent (DGD) method proposed in [13] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest. Several works have shown that distributed methods may compare with their centralized counterparts under certain conditions. For example, the work in [24], [25], [26] first showed that, with sufficiently small constant stepsize, a distributed stochastic gradient method achieves comparable performance to a centralized method in terms of the steady-state mean-square-error.
Despite the aforementioned efforts, it is unclear how long, or how many iterations it takes for a distributed stochastic gradient method to reach the convergence rate of centralized SGD. The number of required iterations, called “transient time” of the algorithm, is a key measurement of the performance of the distributed implementation. In this work, we perform a non-asymptotic analysis for the distributed stochastic gradient descent (DSGD) method adapted from DGD and the diffusion strategy [1].1 In addition to showing that in expectation, the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap (1 − ρw) of the mixing matrix). Furthermore, we characterize the transient time needed for DSGD to achieve the optimal rate of convergence, which behaves as assuming certain conditions on the objective functions, stepsize policy and initial solutions. Finally, we construct a “hard” optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by , implying the obtained transient time is sharp. These results are new to the best of our knowledge.
A. Related Works
We briefly discuss the related literature on (distributed) stochastic optimization. First of all, our work is related to stochastic approximation (SA) methods dating back to the seminal works [27] and [28]. For a strongly convex objective function f with Lipschitz continuous gradients, it has been shown that the optimal convergence rate for solving problem (1) is under a diminishing stepsize policy [29].
Distributed stochastic gradient methods have received much attention in the recent years. For nonsmooth convex objective functions, the work in [30] considered distributed constrained optimization and established asymptotic convergence to the optimal set using two diminishing stepsizes to account for communication noise and subgradient errors, respectively. The paper [31] proposed a distributed dual averaging method which exhibits a convergence rate of under a carefully chosen SA stepsize sequence, where λ2(W) is the second largest singular value of the mixing matrix W. A projected stochastic gradient algorithm was considered in [32] for solving nonconvex optimization problems by combining a local stochastic gradient update and a gossip step. This work proved that consensus is asymptotically achieved and the solutions converge to the set of KKT points with SA stepsizes. In [33], the authors proposed an adaptive diffusion algorithm based on penalty methods and showed that the expected optimization error is bounded by under a constant stepsize α. The work in [34] considered distributed constrained convex optimization under multiple noise terms in both computation and communication stages. By means of an augmented Lagrangian framework, almost sure convergence with a diminishing stepsize policy was established. [35] investigated a subgradient-push method for distributed optimization over time-varying directed graphs. For strongly convex objective functions, the method exhibits an convergence rate. The work in [36] used a time-dependent weighted mixing of stochastic subgradient updates to achieve an convergence rate for minimizing the sum of nonsmooth strongly convex functions. [37] presented a new class of distributed first-order methods for nonsmooth and stochastic optimization which was shown to exhibit an (respectively, ) convergence rate for minimizing the sum of strongly convex functions (respectively, convex functions). The work in [38] considered a decentralized algorithm with delayed gradient information which achieves an rate of convergence for general convex functions. In [39], an convergence rate was established for strongly convex costs and random networks. Recently, the work in [40] proposed a variance-reduced decentralized stochastic optimization method with gradient tracking.
Several recent works have shown that distributed methods may compare with centralized algorithms under various conditions. In addition to [24], [25], [26] discussed before, [41], [42] proved that distributed stochastic approximation performs asymptotically as well as centralized schemes by means of a central limit theorem. [43] first showed that a distributed stochastic gradient algorithm asymptotically achieves comparable convergence rate to a centralized method, but assuming that all the local functions fi have the same minimum. [44], [45] demonstrated the advantage of distributively implementing a stochastic gradient method assuming that sampling times are random and non-negligible. For nonconvex objective functions, [46] proved that decentralized algorithms can achieve a linear speedup similar to a centralized algorithm when k is large enough. This result was generalized to the setting of directed communication networks in [47] for training deep neural networks. The work in [48] considered a distributed stochastic gradient tracking method which performs as well as centralized stochastic gradient descent under a small enough constant stepsize. A recent paper [49] discussed an algorithm that asymptotically performs as well as the best bounds on centralized stochastic gradient descent subject to possible message losses, delays, and asynchrony. In a parallel recent work [50], a similar result was demonstrated with a further compression technique which allowed nodes to save on communication. For more discussion on the topic of achieving asymptotic network independence in distributed stochastic optimization, the readers are referred to a recent survey [51].
B. Main Contribution
We next summarize the main contribution of the paper. First, we begin by performing a non-asymptotic convergence analysis of the distributed stochastic gradient descent (DSGD) method. For strongly convex and smooth objective functions, in expectation, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). We explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. The relevant results are established in Corollary 1 and Theorem 1.
Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate. On the one hand, we show an upper bound of , where 1 − ρw denotes the spectral gap of the mixing matrix of communicating agents. On the other hand, we construct a “hard” optimization problem for which we show that the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by , implying that this upper bound is sharp.
Additionally, we provide numerical experiments that demonstrate the tightness of the theoretical findings. In particular, for the ring network topology and the square grid network topology, simulations are consistent with the transient time for solving the on-line ridge regression problem.
C. Notation
Vectors are column vectors unless otherwise specified. Each agent i holds a local copy of the decision vector denoted by , and its value at iteration/time k is written as xi(k). Let
where denotes transpose. Define an aggregate objective function
and let
In addition, we denote
In what follows we write and for short.
The inner product of two vectors a, b is written as ⟨a, b⟩. For two matrices , let , where Ai (respectively, Bi) is the i-th row of A (respectively, B). We use to denote the 2-norm of vectors and the Frobenius norm of matrices.
A graph has a set of vertices (nodes) and a set of edges connecting vertices . Suppose agents interact in an undirected graph, i.e., if and only if . Each agent i has a set of neighbors .
Denote the mixing matrix of agents by . Two agents i and j are connected if and only if wij, wji > 0 (wij = wji = 0 otherwise). Formally, we make the following assumption on the communication among agents.
Assumption 2. The graph is undirected and connected (there exists a path between any two nodes). The mixing matrix W is nonnegative and doubly stochastic, i.e., W1 = 1 and , where 1 denotes the vector of all ones.
From Assumption 2, we have the following contraction property of W (see [20]).
Lemma 1. Let Assumption 2 hold, and let ρw denote the spectral norm of the matrix . Then, ρw < 1 and
for all , where .
The rest of this paper is organized as follows. We present the DSGD algorithm and some preliminary results in Section II. In Section III we prove the sublinear convergence rate of the algorithm. Our main convergence results and a comparison with the centralized stochastic gradient method are in Section IV. Two numerical example are presented in Section V, and we conclude the paper in Section VI.
II. Distributed Stochastic Gradient Descent
We consider the following DSGD method adapted from DGD and the diffusion strategy [1]: at each step k ≥ 0, every agent i independently performs the update:
| (3) |
where is a sequence of non-increasing stepsizes. The particular choice of the stepsize sequence will be introduced in Section III. The initial vectors xi(0) are arbitrary for all . Since wij = 0 if agent i and agent j are not connected in the network, we can rewrite (3) in the following compact form:
| (4) |
Throughout the paper, we make the following standing assumption regarding the objective functions fi.2 These assumptions are satisfied for many machine learning problems, such as linear regression, smooth support vector machine (SVM), logistic regression, and softmax regression.
Assumption 3. Each is µ-strongly convex with L-Lipschitz continuous gradients, i.e., for any x, ,
| (5) |
Under Assumption 3, Problem (1) has a unique optimal solution , and the following result holds (see [20] Lemma 10).
Lemma 2. For any and α ∈ (0, 2/L), we have
where .
Denote . The following two results are useful for our analysis.
Lemma 3. Under Assumptions 1 and 3, for all k ≥ 0,
where
Proof. By definitions of , and Assumption 1, we have
Notice that from Assumption 3. We have
□
Lemma 4. Under Assumption 3, for all k ≥ 0,
| (6) |
Proof. By definition,
where the last relation follows from Hölder’s inequality. □
A. Preliminary Results
In this section, we present some preliminary results concerning (expected optimization error) and (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration.
For ease of presentation, for all k we denote
In the lemma below, we bound the optimization error U (k +1) by several error terms at iteration k, including the consensus error V (k). It serves as a starting point for the follow-up analysis.
Lemma 5. Suppose Assumptions 1–3 hold. Under Algorithm (4), supposing , we have
| (7) |
Proof. By the definitions of , and relation (4), we have . Hence,
Noting that and from Lemma 3, we obtain
| (8) |
We next bound the first term on the right-hand-side of (8).
where we used the Cauchy-Schwarz inequality. By Lemma 3,
Since , in light of Lemma 2,
Then we have
| (9) |
In light of relation (9), taking full expectation on both sides of relation (8) yields the result. □
The next result is a corollary of Lemma 5 with an additional condition on the stepsize αk. We are able to remove the cross term in relation (7) and obtain a cleaner expression, which facilitates our later analysis.
Lemma 6. Suppose Assumptions 1–3 hold. Under Algorithm (4), supposing , then
| (10) |
Proof. From Lemma 5,
where c > 0 is arbitrary.
Take . Noting that , we have , and . Thus,
□
Since the consensus error term V (k) plays a key role in the statements of Lemma 5 and Lemma 6, we present the following lemma that bounds V (k + 1).
Lemma 7. Suppose Assumptions 1–3 hold. Under Algorithm (4), for all k ≥ 0,
| (11) |
Proof. From relation (4),
we have
Since and ,
and
where the last inequality follows from Assumption 1. Therefore (assuming ρw > 0),
where c > 0 is arbitrary. Letting and noting that by Assumption 3,
we have
Notice that . In light of Lemma 8, taking full expectation on both sides of the above inequality leads to
□
III. Analysis
We are now ready to derive some preliminary convergence results for Algorithm (4). First, we provide a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all k ≥ 0. Then based on the lemmas established in Section II-A, we prove the sublinear convergence rate for Algorithm (4), i.e., and . These results provide the foundation for our main convergence theorems in Section IV.
From now on we consider the following stepsize policy:
| (12) |
where constant θ > 1, and
| (13) |
with denoting the ceiling function.
A. Uniform Bound
We first derive a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all k ≥ 0. Such a result is helpful for bounding the error terms on the right hand sides of (7), (10) and (11).
Lemma 8. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize policy (12), for all k ≥ 0, we have
| (14) |
Proof. The following arguments are inspired by those in [35].
First, we bound for all and k ≥ 0. By Assumption 1,
From the strong convexity and Lipschitz continuity of fi, we know that
and
Hence,
Taking full expectation on both sides, it follows that
From the definition of K in (13), for all k ≥ 0. Hence,
| (15) |
Let us define the following set:
| (16) |
which is non-empty and compact. If , we know from inequality (15) that . Otherwise,
Define the last term above as Ri. The previous arguments imply that for all k ≥ 0,
Note that from relation (4),
We have
| (17) |
We further bound Ri as follows. From the definition of ,
Hence,
| (18) |
In light of inequality (18), further noticing that the choice of 0 is arbitrary in the proof of (17), we obtain the uniform bound for in (14). □
The uniform bound provided in Lemma 8 is critical for deriving the sublinear convergence rates of U (k) and V (k), as it holds for all k ≥ 0.
B. Sublinear Rate
With the help of Lemma 6 and Lemma 7 from Section II-A and Lemma 8, we show in Lemma 10 below that Algorithm (4) enjoys the sublinear convergence rate, i.e., and . For the ease of analysis, we define two auxiliary variables:
| (19) |
We first derive uniform upper bounds for U (k) and V (k) respectively for all k ≥ 0 based on Lemma 8. With these bounds, we are able to characterize the constants appearing in the sublinear convergence rates for U (k) and V (k) in Lemma 10 and Lemma 12 respectively.
Lemma 9. Suppose Assumptions 1–3 hold. Under Algorithm (4), we have
| (20) |
Proof. By definitions of U (k), V (k), and Lemma 8, we have
□
Denote an auxiliary counter
| (21) |
Our strategy is to first show that the consensus error of Algorithm (4) decays as based on Lemma 7, since U (k) does not appear explicitly in relation (11).
Lemma 10. Suppose Assumptions 1–3 hold. Let
| (22) |
Under Algorithm (4) with stepsize (12), for all k ≥ K1 − K, we have
where
| (23) |
with
| (24) |
Proof. From Lemma 7 and Lemma 8, for k ≥ 0,
| (25) |
with c1 defined in (24). From the definitions of αk and in (12) and (19) respectively, we know that when k ≥ K,
We now prove the lemma by induction. For k = K1, we know from Lemma 9. Now suppose for some k ≥ K1, then
To show that , it is sufficient to show
or equivalently,
| (26) |
Since , we have
Hence relation (26) is satisfied with . We then have for all k ≥ K1,
Recalling the connection of and V (k), we conclude that for all k ≥ K1 − K. □
To prove the sublinear convergence of U (k), we start with a useful lemma which provides lower and upper bounds for the product of a decreasing sequence. Such products arise in the convergence proof for U (k) and our main convergence results in Section IV.
Lemma 11. For any and 1 < γ ≤ a/2,
Proof. Denote . We first show that . Suppose for some M1 > 0 and k ≥ a. Then,
To see why the last inequality holds, note that . Taking M1 = aγ, we have . The desired relation then holds for all k > a.
Now suppose for some M2 > 0 and k ≥ a. It follows that
where the last inequality follows from (noting that ). Taking , we have . The desired relation then holds for all k > a. □
In light of Lemma 6 and the other supporting lemmas, we establish the convergence rate of U (k) in the following lemma.
Lemma 12. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12), suppose θ > 2. We have
for all k ≥ K1 − K, where
| (27) |
Proof. In light of Lemma 6 and Lemma 8, for all k ≥ 0, we have
Recalling the definitions of and , for all k ≥ K,
Therefore,
From Lemma 11,
In light of Lemma 10, when k ≥ K1, . Hence,
However, we have for any b > a ≥ K1,
where the last inequality comes from the fact that (given that b > K1 ≥ 4θ), and
Hence, for all k ≥ K1,
Recalling Lemma 9 and the definition of yields the desired result. □
Remark 1. Notice that the convergence rate established in Lemma 12 is not asymptotically the same as centralized stochastic gradient descent, since the constant c2 contains information about the initial solutions. In the next section, we will improve the convergence result and show that DSGD indeed performs as well as centralized SGD asymptotically.
IV. Main Results
In this section, we perform a non-asymptotic analysis of network independence for Algorithm (4). Specifically, in Theorem 1, we show that
where the first term is network independent and the second and third (higher-order) terms depends on . Then we compare the result with centralized stochastic gradient descent and show that asymptotically, the two methods have the same convergence rate . In addition, it takes time for Algorithm (4) to reach this asymptotic rate of convergence. Finally, we construct a “hard” optimization problem for which we show the transient time KT is sharp.
Our first step is to simplify the presentation of the convergence results in Lemma 10 and Lemma 12, so that we can utilize them for deriving improved convergence rates conveniently. For this purpose, we first estimate the constants , , c1 and c2 appearing in the two lemmas and derive their dependency on the network size n, the spectral gap , the summation of initial optimization errors , and , where the last term can be seen as a measure of the difference among each agent’s individual cost functions.
Lemma 13. Denote and . Then,
Proof. We first estimate the constant which appears in the definition (23) for . From Lemma 8,
From the definition of c1 in (24),
Noting that , by definition,
From the definition of c2 in (27), we have
□
In light of Lemma 13, the convergence result of V (k) given in Lemma 10 can be easily simplified since is the only constant. Regarding the optimization error U (k), in light of Lemma 8, Lemma 10, Lemma 12 and Lemma 13, we have the following corollary which simplifies the presentation of the convergence result in Lemma 12.
Corollary 1. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12) and assuming θ > 2, when k ≥ K1 − K,
where
Proof. From Theorem 12 and Lemma 13, when ,
| (28) |
□
Let , the average optimization error for each agent to measure the performance of DSGD. In the following theorem, we improve the result of Corollary 1 with further analysis and derive the main convergence result for Algorithm (4).
Theorem 1. Suppose Assumptions 1–3 hold. Under Algorithm (4) with stepsize (12) and assuming θ > 2, when k ≥ K1 − K,
| (29) |
Proof. For k ≥ K1 −K, in light of Lemma 2 and Lemma 5,
where the second inequality follows from the Cauchy-Schwarz inequality (see [52], page 62) and the fact that .
Recalling the definitions of and , when k ≥ K1,
Therefore, by denoting and , we have
From Lemma 11,
Hence, by Corollary 1,
Since
we have
Notice that and . Following a discussion similar to those in the proofs for Theorem 12 and Corollary 1, we have
Noting that
and , in light of the bound on V(k) in Lemma 10 and the estimate of in Lemma 13, we obtain the desired result. □
A. Comparison with Centralized Implementation
We compare the performance of DSGD and centralized stochastic gradient descent (SGD) stated below:
| (30) |
Where and .
First, we derive the convergence rate for SGD which matches the optimal rate for such stochastic gradient methods (see [29], [53]). Our result relies on an analysis different from the literature that considered a compact feasible set and uniformly bounded stochastic gradients in expectation.
Theorem 2. Under the centra,lize,d stochastic gradient descent of (30), suppose . We have
Proof. Noting that αk ≤ 1/L when k ≥ K2, we have from Lemma 3 that
| (31) |
It can be shown first that for k ≥ K2, where .3 Denote . Then from relation (31), when k ≥ K2,
From Lemma 11,
□
Comparing the results of Theorem 1 and Theorem 2, we can see that asymptotically, DSGD and SGD have the same convergence rate . The next corollary identifies the time needed for DSGD to achieve this rate.
Corollary 2 (Transient Time). Suppose Assumptions 1–3 hold. Assume in addition that and . It takes time for Algorithm (4) to reach the asymptotic rate of convergence, i.e., when k ≥ KT, we have .
Proof. From (29),
Let KT be such that
We then obtain that
□
Remark 2. By assuming the additional conditions and , motivated by the observation that each of these expression is the sum of n terms, we obtain a cleaner expression of the transient time. In general, we would obtain .
Remark 3. For general connected networks such as line graphs, if we adopt the Lazy Metropolis rule for choosing the weights [wij] (see [54]), then , and hence . The transient time can be improved for networks with special structures. For example, is constant with high probability for a random Erdős-Rényi random graph, and consequently on such a graph.
The next theorem states that the transient time for DSGD to reach the asymptotic convergence rate is lower bounded by , that is, under Assumptions 1–3 and assuming and , there exists an optimization problem whose transient time under DSGD is lower bounded by . This implies the result in Corollary 2 is sharp and can not be improved in general.
Theorem 3. Suppose Assumptions 1–3 hold. Assume in addition that and . Then there exists a ρ0 ∈ (0, 1) such that if ρw ≥ ρ0, then the time needed for DSGD to rea ch the a symptotic convergence rate is lower bounded by .
Proof. We construct a “hard” optimization to prove the claimed result, inspired by [31]. Consider quadratic objective functions , where x, . The optimal solution to Problem (1) is given by . The DSGD algorithm implements:
| (32) |
where , and n(k) denotes the vector of gradient noise terms. From (12), we use stepsize , where since µ = L = 1. We rewrite (32) as
It follows that
By induction, we have for all k > 0,
| (33) |
Assume that: (i) the matrix W is symmetric; (ii) Wx* = ρwx*, i.e., x* is an eigenvector of W w.r.t. eigenvalue ρw (hence ); (iii); (iv) x(0) = x*.4 Then , and from relation (33) it follows
| (34) |
where ϵ(k) captures the random perturbation caused by gradient noise that has mean zero. Therefore,
Recalling the definition and , and noticing that , we have
where we invoked Lemma 11 for the second inequality. Then,
| (35) |
Note that when ,
From (35),
where the equality is obtained from the Taylor expansion of ln ρw when ρw → 1. Since
setting this to be at most , we obtain that the transient time for DSGD to re ach the asymptotic convergence rate is lower bounded by , based on an argument similar to that of Corollary 2. □
V. Numerical Examples
In this section, we provide two numerical example to verify and complement our theoretical findings.
A. Ridge Regression
Consider the on-line ridge regression problem, i.e.,
| (36) |
where ρ > 0 is a penalty parameter. Each agent i collects data samples in the form of (ui, vi) continuously where represent the features and are the observed outputs. Assume each ui ∈ [−0.5, 0.5]p is uniformly distributed, and vi is drawn according to , where are predefined parameters evenly located in [0, 10]p, and εi are independent Gaussian random variables (noise) with mean 0 and variance 0.01. Given a pair (ui, vi), agent i can compute an estimated (unbiased) gradient of . Problem (36) has a unique solution x* given by
| (37) |
Suppose p = 10 and ρ = 1. We compare the performance of DSGD (3) and the centralized implementation (30) for solving problem (36) with the same stepsize policy αk = 20/(k +20), ∀k, and the same initial solutions: xi(0) = 0, ∀i, (DSGD) and x(0) = 0 (SGD). It can be seen from (37) and the definition of that . Moreover, . Therefore, we have .
In Fig. 1, we provide an illustration example that compares the performance of DSGD and SGD, assuming n = 25. For DSGD, we consider two different network topologies: ring network topology as shown in Fig. 2(a) and square grid network topology as shown in Fig. 2(a). For both network topologies, we use Metropolis weights for constructing the maxing matrix W (see [55]). It can be seen that DSGD performs asymptotically as well as SGD, while the time it takes for DSGD to catch up with SGD depends on the network topology. For grid networks which are better connected than rings, the corresponding transient time is shorter.
Fig. 1.

The performance comparison between DSGD and SGD for online Ridge regression (n = 25). The results are averaged over 200 Monte Carlo simulation.
Fig. 2.

Comparison of the transient times for DSGD and as a function of the network size n for the ring network topology. The expected errors are approximated by averaging over 200 simulation results.
To further verify the conclusions of Corollary 2 and Theorem 3, we define the transient time for DSGD as inf . For DSGD, we first assume a ring network topology and plot the transient times for DSGD and as a function of the network size n in Fig. 2(b). We then consider a square grid network topology as shown in Fig. 2(a) and plot the transient times for DSGD and in Fig. 2 (b). It can be seen that the two curves in Fig. 2(b) and Fig. 3(b) are close to each other, respectively. This verifies the sharpness of Corollary 2.
Fig. 3.

Comparison of the transient times for DSGD and as a function of the network size n for the square grid network topology (n = 4, 9, 16, 25, 36, 49, 64, 81, 100). The expected errors are approximated by averaging over 200 simulation results.
B. Logistic Regression
Consider the problem of classification on the MNIST dataset of handwritten digits (http://yann.lecun.com/exdb/mnist/). In particular, we classify digits 1 and 2 using logistic regression.5 There are 12700 data points in total where each data point is a pair (u, v) with being the image input and v ∈ {0, 1} being the label.6
Suppose each agent possesses a distinct local dataset that is randomly taken from the database. To apply logistic regression for classification, we solve the following optimization problem based on all the agents’ local datasets:
| (38) |
where
where λ is the regularization parameter.7 Given any solution x, agent i is able to compute an unbiased estimate of using one (or a minibatch of) randomly chosen data point (ui, vi) from , that is,
In the experiments, suppose each local dataset contains 50 data points, and λ = 1. At each iteration of the DSGD algorithm, agent i computes a stochastic gradient of fi(xi(k)) with one randomly chosen data point from . We compare the performance of DSGD (3) and centralized SGD (30) for solving problem (38) with the same stepsize policy αk = 6/(k + 20), ∀k, and the same initial solutions: xi(0) = 0, ∀i, (DSGD) and x(0) = 0 (SGD). It can be numerically verified that and .
The transient time for DSGD is defined in the same way as in the ridge regression example. In Fig. 4 and Fig. 5, we plot the transient times for DSGD as a function of the network size n for ring and grid networks, respectively. We find that the curves are close to , rather than a multiple of , implying that the experimental results are better than the theoretically derived worst-case performance given in Corollary 2. Hence in practice, the performance of the DSGD algorithm depends on the specific problem instances and can be better than the worst-case situation in terms of transient times.
Fig. 4.

Comparison of the transient times for DSGD and as a function of the network size n for the ring network topology. The expected errors are approximated by averaging over 200 simulation results.
Fig. 5.

Comparison of the transient times for DSGD and as a function of the network size n for the grid network topology (n = 4, 9, 16, 25, 36, 49, 64, 81, 100). The expected errors are approximated by averaging over 200 simulation results.
VI. Conclusions
This paper is devoted to the non-asymptotic analysis of network independence for the distributed stochastic gradient descent (DSGD) method. We show that in expectation, the algorithm asymptotically achieves the optimal network independent convergence rate compared to SGD, and identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. In addition, we compute the time needed for DSGD to reach its asymptotic rate of convergence and prove the sharpness of the obtained result. Future work will consider more general problems such as nonconvex objectives and constrained optimization. It will also be of interest to explore the transient times of asynchronous distributed stochastic gradient algorithms which enjoy greater flexibility and less communication overhead.
Acknowledgments
This work was partially supported by the NSF under grants IIS-1914792, DMS-1664644, CNS-1645681, and ECCS-1933027, by the ONR under MURI grant N00014-19-1-2571, by the NIH under grants R01 GM135930 and UL54 TR004130, by the DOE under grants DE-AR-0001282 and DE-EE0009696, by the Boston University Kilachand Fund for Integrated Life Science and Engineering, by the Shenzhen Research Institute of Big Data (SRIBD) under grant J00120190011, and by the NSFC under grant 62003287.
Biographies

Shi Pu is currently an assistant professor in the School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. He is also affiliated with Shenzhen Research Institute of Big Data. He received a B.S. Degree from Peking University, in 2012, and a Ph.D. Degree in Systems Engineering from the University of Virginia, in 2016. He was a postdoctoral associate at the University of Florida, from 2016 to 2017, a postdoctoral scholar at Arizona State University, from 2017 to 2018, and a postdoctoral associate at Boston University, from 2018 to 2019. His research interests include distributed optimization, network science, machine learning, and game theory.

Alex Olshevsky received the B.S. degrees in applied mathematics and electrical engineering from Georgia Tech and the Ph.D. degree in EECS from MIT. He is currently an Associate Professor at the ECE department at Boston University. Dr. Olshevsky is a recipient of the NSF CAREER Award, the AFOSR Young Investigator Award, the INFORMS Prize for the best paper on the interface of operations research and computer science, a SIAM Award for annual paper from the SIAM Journal on Control and Optimization chosen to be reprinted in SIAM Review, and an IMIA award for best paper on clinical informatics.

Ioannis Ch. Paschalidis (M’96–SM’06–F’14) received a Ph.D. in EECS from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 1996. He is a Professor at Boston University, Boston, MA and the Director of the Center for Information and Systems Engineering. His research interests lie in the fields of systems and control, optimization, stochastic systems, machine learning, and computational biology and medicine. He is a recipient of the NSF CAREER award, several best paper and best algorithmic performance awards, and a 2014 IBM/IEEE Smarter Planet Challenge Award. He was an invited participant at the 2002 Frontiers of Engineering Symposium, organized by the U.S. National Academy of Engineering and the 2014 U.S. National Academies Keck Futures Initiative (NAFKI) Conference. During 2013–2019 he was the founding Editor-in-Chief of the IEEE Transactions on Control of Network Systems.
Footnotes
Note that in [1] this method was called “Adapt-then-Combine”.
The assumption can be generalized to the case where the agents have different µ and L.
The argument here is similar to that in the proof for Lemma 12.
Assumptions (iii) and (iv) correspond to the conditions and assumed in the main results such as Theorem 1 and Corollary 2.
The problem can be extended to classifying all 10 handwritten digits with multinomial logistic regression.
Digit 1 is represented by label 0 and digit 2 is represented by label 1.
The obtained optimal solution x* of problem (38) can then be used for predicting the label for any image input u through the decision function .
Contributor Information
Shi Pu, School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China..
Alex Olshevsky, Department of Electrical and Computer Engineering and the Division of Systems Engineering, Boston University, Boston, MA.
Ioannis Ch. Paschalidis, Department of Electrical and Computer Engineering and the Division of Systems Engineering, Boston University, Boston, MA
References
- [1].Chen J and Sayed AH, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012. [Google Scholar]
- [2].Forrester AI, Sóbester A, and Keane AJ, “Multi-fidelity optimization via surrogate modelling,” Proceedings of the Royal Society of London A, vol. 463, no. 2088, pp. 3251–3269, 2007. [Google Scholar]
- [3].Nedić A, Olshevsky A, and Uribe CA, “Fast convergence rates for distributed non-bayesian learning,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5538–5553, 2017. [Google Scholar]
- [4].Cohen K, Nedić A, and Srikant R, “On projected stochastic gradient descent algorithm with weighted averaging for least squares regression,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5974–5981, 2017. [Google Scholar]
- [5].Baingana B, Mateos G, and Giannakis GB, “Proximal-gradient algorithms for tracking cascades over social networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 4, pp. 563–575, 2014. [Google Scholar]
- [6].Ying B, Yuan K, and Sayed AH, “Supervised learning under distributed features,” IEEE Transactions on Signal Processing, vol. 67, no. 4, pp. 977–992, 2018. [Google Scholar]
- [7].Alghunaim SA and Sayed AH, “Distributed coupled multi-agent stochastic optimization,” IEEE Transactions on Automatic Control, 2019. [Google Scholar]
- [8].Brisimi TS, Chen R, Mela T, Olshevsky A, Paschalidis IC, and Shi W, “Federated learning of predictive models from federated electronic health records,” International journal of medical informatics, vol. 112, pp. 59–67, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Cohen K, Nedić A, and Srikant R, “Distributed learning algorithms for spectrum sharing in spatial random access wireless networks,” IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2854–2869, 2017. [Google Scholar]
- [10].Mateos G and Giannakis GB, “Distributed recursive least-squares: Stability and performance analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3740–3754, 2012. [Google Scholar]
- [11].Reisizadeh A, Mokhtari A, Hassani H, and Pedarsani R, “An exact quantized decentralized gradient descent algorithm,” IEEE Transactions on Signal Processing, vol. 67, no. 19, pp. 4934–4947, 2019. [Google Scholar]
- [12].Tsitsiklis J, Bertsekas D, and Athans M, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986. [Google Scholar]
- [13].Nedić A and Ozdaglar A, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009. [Google Scholar]
- [14].Nedić A, Ozdaglar A, and Parrilo PA, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010. [Google Scholar]
- [15].Lobel I, Ozdaglar A, and Feijer D, “Distributed multi-agent optimization with state-dependent communication,” Mathematical programming, vol. 129, no. 2, pp. 255–284, 2011. [Google Scholar]
- [16].Jakovetić D, Xavier J, and Moura JM, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014. [Google Scholar]
- [17].Kia SS, Cortés J, and Martínez S, “Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication,” Automatica, vol. 55, pp. 254–264, 2015. [Google Scholar]
- [18].Shi W, Ling Q, Wu G, and Yin W, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. [Google Scholar]
- [19].Di Lorenzo P and Scutari G, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016. [Google Scholar]
- [20].Qu G and Li N, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, 2017. [Google Scholar]
- [21].Nedić A, Olshevsky A, and Shi W, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. [Google Scholar]
- [22].Xu J, Zhu S, Soh YC, and Xie L, “Convergence of asynchronous distributed gradient methods over stochastic networks,” IEEE Transactions on Automatic Control, vol. 63, no. 2, pp. 434–448, 2017. [Google Scholar]
- [23].Pu S, Shi W, Xu J, and Nedić A, “Push-pull gradient methods for distributed optimization in networks,” IEEE Transactions on Automatic Control, 2020. [Google Scholar]
- [24].Chen J and Sayed AH, “On the limiting behavior of distributed optimization strategies,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2012, pp. 1535–1542. [Google Scholar]
- [25].——, “On the learning behavior of adaptive networks—part i: Transient analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3487–3517, 2015. [Google Scholar]
- [26].——, “On the learning behavior of adaptive networks—part ii: Performance analysis,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3518–3548, 2015. [Google Scholar]
- [27].Robbins H and Monro S, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951. [Google Scholar]
- [28].Kiefer J, Wolfowitz J et al. , “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952. [Google Scholar]
- [29].Nemirovski A, Juditsky A, Lan G, and Shapiro A, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009. [Google Scholar]
- [30].Srivastava K and Nedic A, “Distributed asynchronous constrained stochastic optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 772–790, 2011. [Google Scholar]
- [31].Duchi JC, Agarwal A, and Wainwright MJ, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606, 2012. [Google Scholar]
- [32].Bianchi P and Jakubowicz J, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Transactions on Automatic Control, vol. 58, no. 2, pp. 391–405, 2013. [Google Scholar]
- [33].Towfic ZJ and Sayed AH, “Adaptive penalty-based distributed stochastic convex optimization,” Signal Processing, IEEE Transactions on, vol. 62, no. 15, pp. 3924–3938, 2014. [Google Scholar]
- [34].Chatzipanagiotis N and Zavlanos MM, “A distributed algorithm for convex constrained optimization under noise,” IEEE Transactions on Automatic Control, vol. 61, no. 9, pp. 2496–2511, 2016. [Google Scholar]
- [35].Nedić A and Olshevsky A, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 61, no. 12, pp. 3936–3947, 2016. [Google Scholar]
- [36].Sayin MO, Vanli ND, Kozat SS, and Bas T¸ar, “Stochastic subgradient algorithms for strongly convex optimization over distributed networks,” IEEE Transactions on Network Science and Engineering, vol. 4, no. 4, pp. 248–260, 2017. [Google Scholar]
- [37].Lan G, Lee S, and Zhou Y, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, pp. 1–48, 2017. [Google Scholar]
- [38].Sirb B and Ye X, “Decentralized consensus algorithm with delayed and stochastic gradients,” SIAM Journal on Optimization, vol. 28, no. 2, pp. 1232–1254, 2018. [Google Scholar]
- [39].Jakovetic D, Bajovic D, Sahu AK, and Kar S, “Convergence rates for distributed stochastic optimization over random networks,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 4238–4245. [Google Scholar]
- [40].Xin R, Khan UA, and Kar S, “Variance-reduced decentralized stochastic optimization with gradient tracking,” arXiv preprint arXiv:1909.11774, 2019. [Google Scholar]
- [41].Morral G, Bianchi P, and Fort G, “Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1476–1481. [Google Scholar]
- [42].——, “Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2798–2813, 2017. [Google Scholar]
- [43].Towfic ZJ, Chen J, and Sayed AH, “Excess-risk of distributed stochastic learners,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5753–5785, 2016. [Google Scholar]
- [44].Pu S and Garcia A, “A flocking-based approach for distributed stochastic optimization,” Operations Research, vol. 1, pp. 267–281, 2018. [Google Scholar]
- [45].——, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. [Google Scholar]
- [46].Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5336–5346.
- [47].Assran M, Loizou N, Ballas N, and Rabbat M, “Stochastic gradient push for distributed deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 344–353. [Google Scholar]
- [48].Pu S and Nedić A, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021. [Google Scholar]
- [49].Spiridonoff A, Olshevsky A, and Paschalidis IC, “Robust asynchronous stochastic gradient-push: Asymptotically optimal and network-independent performance for strongly convex functions.” Journal of Machine Learning Research, vol. 21, no. 58, pp. 1–47, 2020. [PMC free article] [PubMed] [Google Scholar]
- [50].Koloskova A, Stich S, and Jaggi M, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in International Conference on Machine Learning, 2019, pp. 3478–3487. [Google Scholar]
- [51].Pu S, Olshevsky A, and Paschalidis IC, “Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent,” IEEE signal processing magazine, vol. 37, no. 3, pp. 114–122, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Williams D, Probability with martingales Cambridge university press, 1991. [Google Scholar]
- [53].Rakhlin A, Shamir O, and Sridharan K, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 1571–1578. [Google Scholar]
- [54].Olshevsky A, “Linear time average consensus and distributed optimization on fixed graphs,” SIAM Journal on Control and Optimization, vol. 55, no. 6, pp. 3990–4014, 2017. [Google Scholar]
- [55].Nedić A, Olshevsky A, and Rabbat MG, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018. [Google Scholar]
