Hybrid-DCA: A double asynchronous approach for stochastic dual coordinate ascent

Soumitra Pal; Tingyang Xu; Tianbao Yang; Sanguthevar Rajasekaran; Jinbo Bi

doi:10.1016/j.jpdc.2020.04.002

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: J Parallel Distrib Comput. 2020 Apr 13;143:47–66. doi: 10.1016/j.jpdc.2020.04.002

Hybrid-DCA: A double asynchronous approach for stochastic dual coordinate ascent

Soumitra Pal ^a,¹, Tingyang Xu ^b,¹, Tianbao Yang ^c, Sanguthevar Rajasekaran ^d, Jinbo Bi ^d,^*

PMCID: PMC7375401 NIHMSID: NIHMS1596845 PMID: 32699464

Abstract

In prior works, stochastic dual coordinate ascent (SDCA) has been parallelized in a multi-core environment where the cores communicate through shared memory, or in a multi-processor distributed memory environment where the processors communicate through message passing. In this paper, we propose a hybrid SDCA framework for multi-core clusters, the most common high performance computing environment that consists of multiple nodes each having multiple cores and its own shared memory. We distribute data across nodes where each node solves a local problem in an asynchronous parallel fashion on its cores, and then the local updates are aggregated via an asynchronous across-node update scheme. The proposed double asynchronous method converges to a global solution for L-Lipschitz continuous loss functions, and at a linear convergence rate if a smooth convex loss function is used. Extensive empirical comparison has shown that our algorithm scales better than the best known shared-memory methods and runs faster than previous distributed-memory methods. Big datasets, such as one of 280 GB from the LIBSVM repository, cannot be accommodated on a single node and hence cannot be solved by a parallel algorithm. For such a dataset, our hybrid algorithm takes less than 30 seconds to achieve a duality gap of 10⁻⁵ on 16 nodes each using 12 cores, which is significantly faster than the best known distributed algorithms, such as CoCoA+, that take more than 160 seconds on 16 nodes.

Keywords: Dual coordinate descent, Distributed computing, Optimization

1. Introduction

The immense growth of data has made it important to efficiently solve large scale machine learning problems. It is necessary to take advantage of modern high performance computing (HPC) environments such as multi-core settings where the cores communicate through shared memory, or multi-processor distributed memory settings where the processors communicate by passing messages. In particular, a large class of supervised learning formulations, including support vector machines (SVMs), logistic regression, ridge regression and many others, solve the following generic regularized risk minimization (RRM) problem: given a set of instance-label pairs of data points (x_i,y_i), i = 1, …, n,

\min_{w \in ℝ^{d}} P (w) : = \frac{1}{n} \sum_{i = 1}^{n} ϕ (x_{i}^{⊺} w; y_{i}) + \frac{λ}{2} g (w),

(1)

where $y_{i} \in R$ is the label for the data point $x_{i} \in R^{d}$ , $w \in R^{d}$ is the linear predictor to be optimized, ϕ is a loss function that is convex with respect to its first argument, λ is a regularization parameter that balances between the loss and a regularizer g(w), which for instance can take the squared ℓ₂-norm ${∥ w ∥}_{2}^{2}$ .

Many efficient sequential algorithms have been developed in the past decades to solve (1), e.g., stochastic gradient descent (SGD) [25], or alternating direction method of multipliers (ADMM) [2]. Especially, (stochastic) dual coordinate ascent (DCA) algorithm [18] has been one of the most widely used algorithms for solving (1). It efficiently optimizes the following dual formulation (2)

\max_{α \in ℝ^{n}} D (α) : = - \frac{1}{n} \sum_{i = 1}^{n} ϕ^{*} (- α_{i}) - \frac{λ}{2} g^{*} (\frac{1}{λ n} X α),

(2)

where $X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{d \times n}$ , ϕ*(u) and g*(v) are the convex conjugates of ϕ(z;y) (or in short ϕ(z) where the scalar z = x^Tw) and g(w), respectively. The conjugate of the loss function ϕ(z) is defined as ϕ*(u) = max_z(zu − ϕ(z)). Let ∇g*(v) be the gradient of g* with respect to v where $v (α) = \frac{1}{λ n} X α$ . We have

w (α) = \nabla g^{*} (v) .

(3)

It is known from duality theory that if α* is an optimal dual solution, then the vector w* = w(α*) is an optimal primal solution and P(w*) = D(α*). The dual objective has a separate dual variable α_i associated with each training data point x_i. The stochastic DCA updates dual variables, one at a time, while maintaining the primal variables by calculating (3) from the dual variables.

Recently, many efforts have been undertaken to solve Problem (1) in a distributed or parallel framework. It has been shown that distributed DCA algorithms have comparable and sometimes even better convergence than SGD-based or ADMM-based distributed algorithms [23]. The distributed DCA algorithms can be grouped into two sets. The first set contains synchronous algorithms in which a random dual variable is updated by each processor and the primal variables are synchronized across the processors in every iteration [8,11,23]. This approach incurs a large communication overhead. The second set of algorithms avoids communication overhead by exploiting the shared memory in a multi-core setting [7] where the primal variables are stored in a primary memory shared across all the processors. Further speedups have been obtained by using (asynchronous) atomic memory operations instead of costly locks for shared memory updates [7,16]. Nevertheless, this approach is difficult to scale up for big datasets that cannot be fully accommodated in the shared memory. This leads to a challenging question: how do we scale up the asynchronous shared memory approach for big data while maintaining the speed up?

We address this challenge by proposing and implementing a hybrid strategy. The modern HPC platforms can be viewed as a collection of K nodes interconnected through a network as shown in Fig. 1(a). Each node contains a memory shared among R processing cores. Our strategy exploits this architecture by equally distributing the data across the local shared memory of the K nodes. Each of the R cores within a node runs a computing thread that asynchronously updates a random dual variable from those associated with the data allocated to the node. Each node also runs a communicating thread. One of the communicating threads is designated as a master and the rest are workers. After every round of H local iterations in each computing thread, each worker thread sends the local update to the master. After accumulating the local updates from S of the K workers, the master broadcasts the global update to the contributing workers. However, to avoid a slower worker falling back too far, the master ensures that in every Γ consecutive global updates there is at least one local update from each worker. Fig. 1(b) shows how our scheme is a generalization of the existing approaches: for K = 1, our setup coincides with the shared memory multi-core setting [7] and for R = 1, S = K our setup coincides with the synchronous algorithms in distributed memory setting [8,11,23]. With a proper adjustment of the parameters H, S, Γ our strategy could balance the computation time of the first setting with the communication time of the second one, while ensuring scalability in big data applications.

Fig. 1. — (a) A simplified view of the modern HPC system and (b) Algorithms on this architecture.

Thus, our contributions are (1) we propose and analyze a hybrid asynchronous shared memory and asynchronous distributed memory implementation (Hybrid-DCA) of the mostly used DCA algorithm to solve (1); (2) we prove a strong guarantee of convergence for L-Lipschitz continuous loss functions, and further linear convergence when a smooth convex loss function is used; and (3) the experimental results using our light-weight OpenMP+MPI implementation show that our algorithms are much faster than existing distributed memory algorithms [8,11], and easily scale up with the volume of data in comparison with the shared memory based algorithms [7] as the shared memory size is limited.

2. Related work

Sequential Algorithms.

SGD is the oldest and simplest method for solving problem (1). Though SGD is easy to implement and converges to modest accuracy quickly, it requires a long tail of iterations to reach ‘good’ solutions and also requires adjusting a step-size parameter. On the other hand, SDCA methods are free of learning-rate parameters and have faster convergence rate around the end [14,15]. A modified SGD has also been proposed with faster convergence by switching to SDCA after quickly reaching a modest solution [18]. Recently, ‘variance reduced’ modifications to the original SGD have also caught attention. These modifications estimate stochastic gradients with corrections to reduce the estimation variance. Mini-batch algorithms are also proposed to update several dual variables (data points) in a batch rather than a single data point per iteration[22]. Mini-batch versions of both SGD and SDCA have slower convergence when the batch size increases[17,19]. These sequential algorithms become ineffective when the datasets get bigger.

Distributed Algorithms.

In the early single communication scheme [5,12,13], a dataset is ‘decomposed’ into smaller parts that can be solved independently. The final solution is reached by ‘accumulating’ the partial solutions using a single round of communications. This method has limited utility because most datasets cannot be decomposed in such a way. Using the primal-dual relationship (3), fully distributed algorithms of DCA are later developed where each processor updates a separate α_i which is then used to update w(α), and synchronizes w across all processors (e.g., CoCoA [8]). To trade off communications vs computations, a processor can solve its subproblem with H dual updates before synchronizing the primal variable (e.g., CoCoA+ [11], DisDCA [23]). In [11,23], a more general framework is proposed in which the subproblem can be solved using not only SDCA but also any other sequential solver that can guarantee a Θ-approximation of the local solution at a processor for some Θ ∈ (0, 1]. Nevertheless, the synchronized update to the primal variables has the inherent drawback that the overall algorithm runs at a speed of the slowest processor even when there are fast processors [1].

Parallel Algorithms.

Multi-core shared memory systems have also been exploited, where the primal variables are maintained in a shared memory, removing the communication cost. However, updates to shared memory requires synchronization primitives, such as locks, which again slows down computation. Recent methods [7,10] avoid locks by exploiting (asynchronous) atomic memory updates in modern memory systems. There is even a wild version in [7] that takes arbitrarily one of the simultaneous updates. Though the shared memory algorithms are faster than the distributed versions, they have an inherent drawback of being not scalable, as there can be only a few cores in a processor board.

Other Distributed Methods for RRM.

Besides distributed DCA methods, there are several recent distributed versions of other algorithms with faster convergence, including distributed Newton-type methods (DISCO [28], DANE [20]) and distributed stochastic variance reduced gradient method (DSVRG [9]). It has been shown that they can achieve the same accurate solution using fewer rounds of communication, however, with additional computational overhead. In particular, DISCO and DANE need to solve a linear system in each round, which could be very expensive for higher dimensions. DSVRG requires each machine to load and store a second subset of the data sampled from the original training data, which also increases its running time.

The ADMM [2] and quasi-Newton methods such as L-BFGS also have distributed solutions. These methods have low communication cost, however, their inherent drawback of computing the full batch gradient does not give computation vs communications trade-off. In the context of consensus optimization, [26] gives an asynchronous distributed ADMM algorithm but that does not directly apply to solving (1).

To the best of our knowledge, this paper is the first to propose, implement and analyze a hybrid approach exploiting modern HPC architecture. Our approach is the amalgamation of three different ideas – (1) CoCoA+/DisDCA distributed framework, (2) asynchronous multi-core shared-memory solver [7] and (3) asynchronous distributed approach [27] – taking the best of each of them. In a sense ours is the first algorithm which asynchronously uses updates which themselves have been computed using asynchronous methods.

3. The Proposed Algorithm

At the core of our algorithm, the data are distributed across K nodes and each node, called a worker, repeatedly solves a perturbed dual formulation on its data partition and sends the local update to one of the workers additionally designated as the master which merges the local updates and sends back the accumulated global update to the workers to solve the subproblem once again, unless a global convergence is reached. Let $I_{k} \subseteq {1, 2, \dots, n}$ , k = 1, …, K denote the indices of the data and the dual variables residing on node k and $n_{k} = ∣ I_{k} ∣$ . For any $α \in R^{n}$ let α_[k] denote the vector in $R^{n}$ defined in such a way that the ith component (α_[k]_i = α_i if $i \in I_{k}$ , 0 otherwise, so that α = ∑_k α_[k]. Let $X_{[k]} \in R^{d \times n}$ denote the matrix consisting of the columns of the $X \in R^{d \times n}$ indexed by $I_{k}$ and replaced with zeros in all other columns, so that X = ∑_kX_[k].

Ideally, the dual problem solved by node k is (2) with X, α replaced by X_[k], α_[k], respectively, and hence is independent of other nodes. However, following the efficient practical implementation in [11,23], we let the workers communicate among them a vector $v \in R^{d}$ , an estimate of $w (α) = \frac{1}{λ n} X α$ that summarizes the last known global solution α. Also following [11,23] for faster convergence, each worker in our algorithm solves the following perturbed local dual problem, which we henceforth call the subproblem:

\max_{δ_{[k]} \in ℝ^{n}} D_{k} (δ_{[k]}; v, α_{[k]}) : = - \frac{1}{n_{k}} \sum_{i \in I_{k}} ϕ^{*} (- α_{i} - δ_{i}) - \frac{λ}{2 S} g^{*} (v) - 〈 \frac{1}{n} X_{[k]}^{⊺} \nabla g^{*} (v), δ_{[k]} 〉 - \frac{λ σ}{2} {‖ \frac{1}{λ n} X_{[k]} δ_{[k]} ‖}^{2}

(4)

where δ_[k] denotes the local (incremental) update to the dual variable δ_[k], the bounding barrier parameter S denotes the number of workers from which the updates would be merged by the master in a global iteration and the scaling parameter σ measures the difficulty of solving the given data partition (see [11,23]) and must be chosen such that

σ \geq σ_{\min} : = ν \max_{α \in ℝ^{n}} \frac{{‖ X α ‖}^{2}}{\sum_{k = 1}^{K} {‖ X_{[k]} α_{[k]} ‖}^{2}}

(5)

where the aggregation parameter $ν \in [\frac{1}{S}, 1]$ is the weight given by the master to each of local updates from the contributing workers while computing the global update. Unlike the synchronous all reduce approach in [11], our asynchronous method merges the local updates from only S out of K nodes in each global update and the second term in the objective of our subproblem (4) has denominator S in stead of K. By Lemma 3.2 in [11], σ ≔ νS is a safe choice to hold condition (5).

3.1. Asynchronous updates by cores in a worker node

In each communication round, each worker k solves its subproblem using a parallel asynchronous DCA method [7] on the R cores. Let the data partition $I_{k}$ stored in the shared memory be logically divided into R subparts where subpart $I_{k, r} \subseteq I_{k}$ , r = 1, …, R, is exclusively used by core r. In each of the H iterations, core r chooses a random coordinate $i \in I_{k, r}$ and updates δ_[k] in the ith unit direction by a step size ε computed using a single variable optimization problem:

ε = \underset{ε \in ℝ}{\arg \max} D_{k} (ε e_{i}; v, α_{[k]} + δ_{[k]})

(6)

which has a closed form solution for SVM problems [4], and a solution using an iterative solver for logistic regression problems [24]. The local updates to v are also maintained appropriately. Because the coordinates used by any two cores are randomly chosen in parallel, the corresponding updates to δ_[k] are independent of each other. Thus, there might be conflicts in the updates to v if the corresponding columns (the allocated examples in different cores) in X have nonzero values in the same row (resulting in different updates at the same element position of v). We use lock-free atomic memory updates to handle such conflicts. When all cores complete H iterations, worker k sends the accumulated update Δv from the current round to the master; waits until it receives the globally updated v from the master; and repeats for another round unless the master indicates termination.

Algorithm 1:

Hybrid-DCA: Worker k

graphic file with name nihms-1596845-t0017.jpg

Open in a new tab

Algorithm 2:

Hybrid-DCA: Master

graphic file with name nihms-1596845-t0018.jpg

Open in a new tab

3.2. Merging updates from workers by master

If the master had to wait for the updates from all the workers, it could compute the global updates only after the slowest worker finished. To avoid this problem, we use bounded barrier: in each round t, the master waits for updates from only a subset $P_{S}$ of workers of size S ≤ K, and sends them back the global update $v^{(t + 1)} = v^{(t)} + ν \sum_{k \in P_{S}^{(t)}} Δ v_{k}^{(t)}$ . However, due to this relaxation, there might be some slow workers with out-of-date v. When updates from such workers are merged by the master, it may degrade the quality of the global solution and hence may cause slow convergence or even divergence. We ensure sufficient freshness of the updates using bounded delay: the master makes sure that no worker has a stale update older than Γ rounds. This asynchronous approach has two benefits: (1) the overall progress is no more bottlenecked by the slowest processor, and (2) the total number of communications is reduced. On the flip side, convergence may get slowed down for very small S or very large Γ.

Example.

Fig. 2 shows a possible sequence of important events in a run of our algorithm on a dataset having n = 12 data points in d = 3 dimensions using K = 3 nodes each having R = 2 cores such that each core works with only $∣ I_{k, r} ∣ = 2$ data points. The activities in solving the subproblem using H = 1 local iterations in a round is shown in a rectangular box. For the first subproblem, core 1 and core 2 in worker 1 randomly select dual coordinates such that the corresponding data points have nonzero entries in the dimensions {1, 3} and {1, 2, 3}, respectively. Each core first reads the entries of v corresponding to these nonzero data dimensions, and then computes the updates [0.1, 0, 0.7] and [0.15, 0.5, 0.4], respectively, and finally applies these updates to v (where v₁ is first updated to be 0.1 + 0.15 = 0.25 from both of the cores, then v₂ is updated to 0.5 from core 2 while v₃ is updated to 0.7 from core 1, and then v₃ is augmented by 0.4 from core 2 to reach 0.7 + 0.4 = 1.1). The atomic memory updates ensure that all the conflicting writes to v, such as v₁ in the first write-cycle, happen completely. At the end of H local iterations by each core, worker 1 sends Δv = [0.25, 0.5, 1.1] to the master, the responsibility of which is shared by one of the 3 nodes, but shown separately in the figure. By this time, the faster workers 2 and 3 already complete 3 rounds. As S = 2, the master takes first 2 updates from $P_{S}^{(1)} = P_{S}^{(2)} = {2, 3}$ and computes the global updates using ν = 1. However, as Γ = 2, the master holds back the third updates from workers 2,3 until the first update from worker 1 reaches the master. The subsequent events in the run are omitted in the figure.

3.3. Communication cost analysis

In each communication round, the algorithms based on synchronous updates on all K nodes require 2K transmissions, each consisting of all values of v or Δv. Half of these transmissions are from the workers to the master and the rest are from the master to the workers. Whereas, our asynchronous update scheme requires 2S transmissions in each round.

4. Convergence Analysis

In this section we prove the convergence of the global solution computed by our hybrid algorithm. We prove for the case of regularizer g(w) = ‖w‖² as an example; the proof can be similarly extended to other regularizes g(w). For this special case, g*(v) = ‖v‖², ∇g*(v) = 2v, the simplified dual formulation is the following

\max_{α \in ℝ^{n}} D (α) : = - \frac{1}{n} {\sum_{i = 1}^{n} ϕ^{*} (- α_{i}) - \frac{λ}{2} ‖ (\frac{1}{λ n} X α) ‖}^{2},

(7)

and the corresponding subproblem formulation is the following

\max_{δ_{[k]} \in ℝ^{n}} D_{k} (δ_{[k]}; v, α_{[k]}) : = - \frac{1}{n_{k}} \sum_{i \in I_{k}} ϕ^{*} (- α_{i} - δ_{i}) - \frac{λ}{2 S} {‖ v ‖}^{2} - 〈 \frac{1}{2 n} X_{[k]}^{⊺} v, δ_{[k]} 〉 - \frac{λ σ}{2} {‖ \frac{1}{λ n} X_{[k]} δ_{[k]} ‖}^{2} .

(8)

The analysis is divided into two parts. First we show that the solution of the subproblem computed by each node locally is indeed not far from the optimum of the subproblem. Using this result on the subproblem, we next show the convergence of the global solution. Though our proofs for the two parts are based on the works [7] and [11], respectively, we need to make significant adjustments in the proofs due to our modified framework handling two cascaded levels of asynchronous updates.

In our analysis we focus on all the events that are important for the local updates that are merged by the master in global update t. Fig. 3 shows an example where the master merges local updates from S = 2 workers.

4.1. Near optimality of the solution to the local subproblem

In this section we prove that the solution returned by the parallel asynchronous stochastic DCA solver used by each worker k in Algorithm 1 is not far from the optimal solution for the subproblem (4).

Definition 1 (Θ-approximate). For given v, α_[k], a solution δ_[k] to the subproblem (4) is said to be Θ-approximate, Θ ∈ [0, 1), if

E [D_{k} (δ_{[k]}^{*}; v, α_{[k]}) - D_{k} (δ_{[k]}; v, α_{[k]})] \leq Θ (D_{k} (δ_{[k]}^{*}; v, α_{[k]}) - D_{k} (0; v, α_{[k]}))

(9)

where $δ_{[k]}^{*}$ is an optimum solution to (4).

Though our proof is based on the results in [7], the main challenge here is to tackle the following two modifications in our approach: (1) the solver here solves only a part of the dual problem and (2) the subproblem is now perturbed (Section 3). While the first modification is simply handled by considering the updates by the cores in worker k only, the second modification needs changes in each step of the proof in [7]. We give below the complete details of each of the steps.

Worker k solves subproblem (4) by applying total R × H updates, each of its R cores makes H updates. To show Θ-approximate, we need to show that sufficient progress is made between any two successive updates. However, each of the cores makes multiple atomic memory writes in an update, the updates made by different cores are interleaved and hence it is difficult to demarcate two successive updates. Nevertheless, depending upon the order cores select a data point $\in I_{k}$ as in step 6 of Algorithm 1 we assign a node-level counter j for each of the total R × H updates in node k and let the index i(j) denote the data point selected for update j ∈ {1, …, RH}. Fig. 3 shows an example of local updates i(j).

In each update, a worker core computes step size ε using line 6 of Algorithm 1 and then applies ε in i(j)th axis to α_[k]. However, there are few subtle points to notice that happen due to the atomic updates. Firstly, when a core computes ε it starts with a v but by the time it reads a coordinate of v some other core might have already modified some other coordinates. So the effective v that a core uses to compute the increment ε might not be the actual v at the memory, and in fact it might not exist at all in the memory at any time. Let $\overset{‒}{v}$ denote the effective v that a core uses to compute ε, and let $\hat{v}$ denote the actual v value in the memory. Fig. 4 helps readers connect the different notation and updates used in the proofs in this section.

Fig. 4. — Relationship among different approximations of α.

For all $i \in I_{k}$ , we have the following definitions:

h_{i} (u) : = \frac{ϕ_{i}^{*} (- u)}{n {‖ x_{i} ‖}^{2}} + \frac{λ}{2} (\frac{1}{S} - \frac{1}{σ}) \frac{{‖ w ‖}^{2}}{{‖ x_{i} ‖}^{2}}

{prox}_{i} (s) : = \underset{u}{\arg \min} \frac{1}{2} {(u - s)}^{2} + h_{i} (u)

T_{i} (w, s) : = \underset{u}{\arg \max} - \frac{1}{σ} \frac{λ}{2} {‖ w ‖}^{2} - \frac{1}{n} w^{⊺} x_{i} (u - s) - \frac{λ}{2} σ {(\frac{1}{λ n} x_{i} (u - s))}^{2} - \frac{1}{n} ϕ_{i}^{*} (- u) - \frac{λ}{2} (\frac{1}{K} {‖ w ‖}^{2} - \frac{1}{σ} {‖ w ‖}^{2}) = \underset{u}{\arg \max} - \frac{λ}{2} {‖ \frac{w}{\sqrt{σ}} + \frac{\sqrt{σ}}{λ n} (u - s) x_{i} ‖}^{2} - {‖ x_{i} ‖}^{2} h_{i} (u) = \underset{u}{\arg \min} \frac{1}{2} (u - (s {- \frac{λ n w^{⊺} x_{i}}{σ {‖ x_{i} ‖}^{2}}))}^{2} + h_{i} (u),

where $w \in R^{d}$ denotes any fixed vector, $s \in R$ , and prox(s) denotes the proximal operator. We can see the connection of the above operator to the proximal operator: $T_{i} (w, s) = {prox}_{i} (s - \frac{w^{⊺} x_{i}}{σ {∥ x_{i} ∥}^{2}})$ . Here both h_i(u) and T_i(w,s) were revised from [7] to satisfy the subproblem (4).

Let $\overset{‒}{X}$ denote the normalized data matrix at k-th local atomic solver with omitted notation _[k] where each row is ${\overset{‒}{x}}_{i}^{⊺} = x_{i}^{⊺} ∕ ∥ x_{i} ∥$ , $i \in I_{k}$ . Define $M_{[k], i} = \max_{D \subseteq [d]} ∥ \sum_{t \in D} {\overset{‒}{X}}_{(:, t)} X_{(i, t)} ∥$ , M = max_k max_i M_[k],i over all the local atomic solvers, where [d] is the set of all the feature indices, and ${\overset{‒}{X}}_{(:, t)}$ is the t-th column of $\overset{‒}{X}$ . Moreover, R_min is defined as the minimum value of global data matrix, i.e. R_min = min_i=1,…,n ‖x_i‖². Then, we define that:

Definition 2 (Local atomic dual variables). Here we omit _[k] in the proofs of the local atomic solver.

\begin{matrix} β_{t}^{l + 1} = {\begin{matrix} T_{t} ({\hat{w}}^{l}, β_{c}^{l}) & if & c = i (l), \\ β_{c}^{l} & if & c \neq i (l) \end{matrix}, & ε^{l} = β_{i (l)}^{l + 1} - β_{i (l)}^{l}, \\ {\tilde{β}}^{l + 1} = T ({\hat{w}}^{l}, β^{l}), & {\bar{β}}^{l + 1} = T ({\bar{w}}^{l}, β^{l}), \end{matrix}

where $β^{l} = α_{[k]} + ν δ_{[k]}^{l}$ denotes the l-th sequence generated by a specific k-th local atomic solver, ${\hat{w}}^{l}$ denotes the actual values of w maintained at update l in the local atomic solver; i(l) indicates the index selected at l-th update; and ${\overset{‒}{w}}^{l}$ refers to the “accurate” w if all cores are synchronously updated at iteration l. Note that, ${\tilde{β}}_{i (l)}^{l + 1} = β_{i (l)}^{l + 1}$ and ${\tilde{β}}^{l + 1} = prox (β^{l} - \frac{λ n}{σ} \overset{‒}{X} {\hat{w}}^{l})$ . Since v and α_[k] will not be changed when solving the local subproblem D_k(δ_[k]; v, α_[k]), we denote Q^σ(β^l) as the objective value of the dual subproblem at l-th update and omit the subscripts _[k].

Assumption 1 (Lipschitz Continuous). The global problem objective (2) is L_max-Lipschitz continuous and therefore, its local subproblems objective (4) are at most L_max-Lipschitz continuous.

The following propositions are cited from [7], and we use these results in our proof.

Proposition 1 (Expectation of Dual Variables).

E_{i (l)} ({‖ β^{l + 1} - β^{l} ‖}^{2}) = \frac{1}{n} {‖ {\tilde{β}}^{l + 1} - β^{l} ‖}^{2},

(10)

Proposition 2 (Boundary of Asynchronous Variables).

‖ {\bar{X} \bar{w}}^{l} - {\bar{X} \hat{w}}^{l} ‖ \leq \frac{1}{λ n} M \sum_{c = l - γ}^{l} | ε^{c} |,

(11)

Proposition 3.

| T_{i} (w_{1}, s_{1}) - T_{i} (w_{2}, s_{2}) | \leq | s_{1} - s 2 + \frac{{(w_{1} - w_{2})}^{⊺} x_{i}}{{‖ x_{i} ‖}^{2}} |,

(12)

Proposition 4. Let M ≥ 1, $q = \frac{6 (γ + 1) e M}{\sqrt{n}}$ , ρ = (1 + q)², and $θ = \sum_{t = 1}^{γ} ρ^{1 ∕ 2}$ . If q(γ + 1) ≤ 1 and σ ≥ 1, then ρ^(γ+1)/2 ≤ e, and

ρ^{- 1} \leq 1 - \frac{4}{\sqrt{n}} - \frac{4 M + 4 M θ}{\sqrt{n}} \leq 1 - \frac{4}{\sqrt{n}} - \frac{4 M + 4 M θ}{σ \sqrt{n}},

(13)

Proposition 5 (Properties of Dual Concave Function). For all j > 0, we have

Q^{σ} (β^{l}) \leq Q^{σ} ({\bar{β}}^{l + 1}) - \frac{σ {‖ x_{i (l)} ‖}^{2}}{2} {‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2},

(14)

Q^{σ} (β^{l}) \geq Q^{σ} ({\bar{β}}^{l + 1}) - \frac{L_{\max}}{2} {‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2}

(15)

Proof. Two properties of dual concave function are stated as follows.

the strong convexity of Q^σ(β^l): as all conjugate functions are convex, so it is clear that Q^σ(β^l) is σ ‖x_i(l)‖²-strongly convex.
the Lipschitz continuous gradient of Q^σ(β^l): refer to Assumption 1. □

Because of the atomic updates, the step size computation may not include all the latest updates, but we assume all the updates before the (l − γ)-th update have already been written into v.

Assumption 2 (Bounded Delay of Local Updates, γ).

{(γ + 1)}^{2} \leq \frac{\sqrt{n_{k}}}{6 e M}, where e is the Euler ’ s number .

(16)

This assumption restricts the maximum allowed local delay γ by M and n_k.

Lemma 6. Under Assumption 2, Definition 2, and $ρ = {(1 + \frac{6 (γ + 1) e M}{{\sqrt{n}}_{k}})}^{2}$ . Then, the local subproblem satisfies:

E [{‖ β_{[k]}^{l - 1} - {\tilde{β}}_{[k]}^{l} ‖}^{2}] \leq ρ E [{‖ β_{[k]}^{l} - {\tilde{β}}_{[k]}^{l + 1} ‖}^{2}] .

(17)

Not that l ≠ h, represents the lth update to ω in a local solver but not the hth iteration of one core.

Proof. We omit the subscript _[k] in the notations, which specifies the kth data partition, in the proof. We prove Eq. (17) by induction. As shown in [7], we have

{‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2} - {‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2} \leq 2 ‖ β^{l - 1} - {\tilde{β}}^{l} ‖ ‖ β^{l} - {\tilde{β}}^{l + 1} - β^{l - 1} + {\tilde{β}}^{l} ‖ .

(18)

The second factor in the r.h.s of Eq (18) is bounded as follows with the revisions:

‖ β^{l} - {\tilde{β}}^{l + 1} - β^{l - 1} + {\tilde{β}}^{l} ‖ \leq ‖ β^{l} - β^{l - 1} ‖ + ‖ prox (β^{l} - \frac{λ n}{σ} {\bar{X} \hat{w}}^{l}) - prox (β^{l - 1} - \frac{λ n}{σ} {\bar{X} \hat{w}}^{l - 1}) ‖ \leq ‖ β^{l} - β^{l - 1} ‖ + ‖ (β^{l} - \frac{λ n}{σ} {\bar{X} \hat{w}}^{l}) - (β^{l - 1} - \frac{λ n}{σ} {\bar{X} \hat{w}}^{l - 1}) ‖ \leq 2 ‖ β^{l} - β^{l - 1} ‖ + \frac{λ n}{σ} ‖ {\bar{X} \hat{w}}^{l} - {\bar{X} \hat{w}}^{l - 1} ‖ = 2 ‖ β^{l} - β^{l - 1} ‖ + \frac{λ n}{σ} ‖ {\bar{X} \hat{w}}^{l} - {\bar{X} \bar{w}}^{l} + {\bar{X} \bar{w}}^{l} - {\bar{X} \bar{w}}^{l - 1} + {\bar{X} \bar{w}}^{l - 1} - {\bar{X} \hat{w}}^{l - 1} ‖ \leq 2 ‖ β^{l} - β^{l - 1} ‖ + \frac{λ n}{σ} (‖ {\bar{X} \bar{w}}^{l} - {\bar{X} \bar{w}}^{l - 1} ‖ + ‖ {\bar{X} \hat{w}}^{l} - {\bar{X} \bar{w}}^{l} ‖ + ‖ {\bar{X} \bar{w}}^{l - 1} - {\bar{X} \hat{w}}^{l - 1} ‖) \leq (2 + 2 \frac{λ n}{σ} \frac{M}{λ n}) ‖ β^{l} - β^{l - 1} ‖ + 2 \frac{λ n}{σ} \frac{M}{λ n} \sum_{c = l - γ - 1}^{l - 2} | ε^{c} | (Proposition 2)

(19)

\leq (2 + 2 \frac{M}{σ}) ‖ β^{l} - β^{l - 1} ‖ + 2 \frac{M}{σ} \sum_{c = l - γ - 1}^{l - 2} | ε^{c} |

(20)

Now we start the induction. Although some steps may be the same as the steps in [7], we still keep them here to make the proof self-contained.

Induction Hypothesis.

We prove the following equivalent statement. For all j,

E ({‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) \leq ρ E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}),

Induction Basis.

When l = 1,

E ({‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}) - E ({‖ β^{1} - {\tilde{β}}^{2} ‖}^{2}) \leq 2 E (‖ β^{0} - {\tilde{β}}^{1} ‖ ‖ β^{1} - {\tilde{β}}^{2} - β^{0} - {\tilde{β}}^{1} ‖) \leq (4 + 4 \frac{M}{2}) E (‖ β^{0} - {\tilde{β}}^{1} ‖ ‖ β^{0} - β^{1} ‖) .

By Proposition 1 and AM–GM inequality, which for any b₁, b₂ > 0 and any c > 0, we have

b_{1} b_{2} \leq \frac{1}{2} (c b_{1}^{2} + c^{- 1} b_{2}^{2})

(21)

Therefore, we have

E (‖ β^{0} - {\tilde{β}}^{1} ‖ ‖ β^{0} - β^{1} ‖) \leq \frac{1}{2} E (\sqrt{n} {‖ β^{0} - β^{1} ‖}^{2} + \frac{1}{\sqrt{n}} {‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}) = \frac{1}{2} E (\frac{1}{\sqrt{n}} {‖ β^{0} - {\tilde{β}}^{1} ‖}^{2} + \frac{1}{\sqrt{n}} {‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}) (Proposition 1) = \frac{1}{\sqrt{n}} E ({‖ β^{0} - {\tilde{β}}^{1} ‖}^{2})

Therefore,

E ({‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}) - E ({‖ β^{1} - {\tilde{β}}^{2} ‖}^{2}) \leq (\frac{4}{\sqrt{n}} + \frac{4 M}{σ \sqrt{n}}) E ({‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}),

which implies

E ({‖ β^{0} - {\tilde{β}}^{1} ‖}^{2}) \leq {(1 - \frac{4}{\sqrt{n}} - \frac{4 M}{σ \sqrt{n}})}^{- 1} E ({‖ β^{1} - {\tilde{β}}^{2} ‖}^{2}) \leq ρ E ({‖ β^{1} - {\tilde{β}}^{2} ‖}^{2}),

where the last inequality is based on Proposition 4 and the fact θM ≥ 1.

Induction Step.

By the induction hypothesis, we assume

E ({‖ β^{h - 1} - {\tilde{β}}^{h} ‖}^{2}) \leq ρ E ({‖ β^{h} - {\tilde{β}}^{h + 1} ‖}^{2}) \forall h \leq l - 1.

(22)

To show

E ({‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) \leq ρ E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}),

we firstly show that for all h < l,

E (‖ β^{h} - β^{h + 1} ‖ ‖ β^{l - 1} - {\tilde{β}}^{l} ‖) \leq \frac{1}{2} E (\sqrt{n} ρ^{(h + 1 - l) / 2} {‖ β^{h} - β^{h + 1} ‖}^{2} + \frac{1}{\sqrt{n}} ρ^{(l - 1 - h) / 2} {‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) (Eq. 21)) = \frac{1}{2} E (\sqrt{n} ρ^{(h + 1 - l) / 2} E ({‖ β^{h} - β^{h + 1} ‖}^{2}) + \frac{1}{\sqrt{n}} ρ^{(l - 1 - h) / 2} {‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) = \frac{1}{2} E (\frac{1}{\sqrt{n}} ρ^{(h + 1 - l) / 2} {‖ β^{h} - {\tilde{β}}^{h + 1} ‖}^{2} + \frac{1}{\sqrt{n}} ρ^{(l - 1 - h) / 2} {‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) (Proposition 1) \leq \frac{1}{2} E (\frac{1}{\sqrt{n}} ρ^{(h + 1 - l) / 2} ρ^{l - c - 1} {‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2} + \frac{1}{\sqrt{n}} ρ^{(l - 1 - h) / 2} {‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) (Eq. 22)) \leq \frac{ρ^{(l - 1 - h) / 2}}{\sqrt{n}} E ({‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) .

(23)

Let $θ = \sum_{h = 1}^{γ} ρ^{h ∕ 2}$ . We have

E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖^{2}) - E (‖ β^{l} - {\tilde{β}}^{l + 1} ‖) \leq E (2 ‖ β^{l - 1} - {\tilde{β}}^{l} ‖ ((2 + 2 \frac{M}{σ})) ‖ β^{l - 1} - β^{l} ‖ + 2 \frac{M}{σ} ‖ β^{h - 1} - β^{h} ‖)) (Eqs. (18), (19)) = (4 + 4 \frac{M}{σ}) E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖ ‖ β^{l - 1} - β^{l} ‖) + 4 \frac{M}{σ} \sum_{c = l - γ - 1}^{l - 1} E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖ ‖ β^{c - 1} - β^{c} ‖) \leq \frac{4 σ + 4 M}{σ \sqrt{n}} E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖^{2}) + \frac{4 M}{σ \sqrt{n}} E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖) \sum_{c = l - γ - 1}^{l - 2} ρ^{(l - 1 - c) / 2} (Eq. (23)) \leq \frac{4 σ + 4 M}{σ \sqrt{n}} E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖^{2}) + \frac{4 M}{σ \sqrt{n}} θ E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖) \leq (\frac{4}{\sqrt{n}} + \frac{4 M + 4 M θ}{σ \sqrt{n}}) E (‖ β^{l - 1} - {\tilde{β}}^{l} ‖^{2})

which implies that

E ({‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2}) \leq \frac{1}{1 - \frac{4}{\sqrt{n}} - \frac{4 M + 4 M θ}{σ \sqrt{n}}} E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}) \leq ρ E ({‖ β^{l - 1} - {\tilde{β}}^{l} ‖}^{2})

by Proposition 4. □

Lemma 6 implies that the asynchronous updates will not pull the solution away from the optimal solution too much even if the directions of the updates are wrong.

Definition 3 (Global Error Bound). For a convex function $f : R^{n} \to R$ , the optimization problem: min_β f(β) admits a global error bound if there is a constant κ such that

‖ β - P_{S} (β) ‖ \leq κ ‖ T (β) - β ‖,

(24)

where P_S(⋅) is the Euclidean projection to the set of optimal solutions, and $T : R^{n} \to R^{n}$ is the operator defined as

T_{i} (β) = \arg \min_{u} f (β + (u - β_{i}) e_{h}) \forall i \in [n] .

The optimization problem admits a relaxed condition called global error bound from the beginning if (24) holds for any β satisfying f(β) ≤ F for some constant F.

Assumption 3. The local subproblem formulation (4) admits the global error bound from the beginning for $F = Q (δ_{[k]}^{(j)}; v^{(j)}, α_{[k]}^{(j)})$ and any update j.

The global error bound in the local subproblem helps prove that our subproblem solver achieves significant improvement after each update. It has been shown that when the loss functions are hinge loss or squared hinge loss, the global problem formulation (2) does indeed satisfy the global error bound condition [7]. Then, for the local subproblem (4), it still satisfy the global error bound within the subset α_[k].

Assumption 4 (Bounded M, L_max).

2 L_{\max} (1 + \frac{e^{2} γ^{2} M^{2}}{σ^{2} n_{k}}) (\frac{e^{2} γ^{2} M^{2}}{σ^{2} n_{k}}) \leq 1

Lemma 7 (Convergence for Subproblem). When Assumptions 2–4 hold, the solutions computed in two successive updates by the local subproblem solver has a linear convergence rate in expectation, i.e.,

E [D_{k} (δ_{[k]}^{*}) - D_{k} (δ_{[k]}^{(j)})] \leq Θ [D_{k} (δ_{[k]}^{*}) - D_{k} (δ_{[k]}^{(j - 1)})]

where $δ_{[k]}^{(j)} = δ_{[k]}^{(j)} + \sum_{h = 1}^{H} \sum_{r = 1}^{R} β_{[k]}^{h, r}$ is the δ_[k] after the jth update,

η = 1 - \frac{κ R_{\min}}{2 n L_{\max}} (1 - \frac{2 L_{\max}}{R_{\min}} (1 + \frac{e^{2} γ^{2} M^{2}}{σ^{2} \tilde{n}}) (\frac{e^{2} γ^{2} M^{2}}{σ^{2} \tilde{n}})),

$\tilde{n} = \max_{k} n_{k}$ is the size of the largest data part, and

Θ = η^{R H} .

(25)

Proof. We also omit the subscript _[k] of the notations in the proof. We can bound the expected distance $E ({∣ {\overset{‒}{β}}^{j + 1} - {\tilde{β}}^{l + 1} ∣}^{2})$ by the following derivation.

E ({‖ {\bar{β}}^{l + 1} - {\tilde{β}}^{l + 1} ‖}^{2}) = E (\sum_{t = 1}^{n} {(T_{t} ({\bar{w}}^{l}, β_{t}^{l}) - T_{t} ({\hat{w}}^{l}, β_{t}^{l}))}^{2}) \leq E (\sum_{t = 1}^{n} {(\frac{λ n {({\bar{w}}^{l} - {\hat{w}}^{l})}^{⊺} x_{t}}{σ {‖ x_{t} ‖}^{2}})}^{2}) (Proposition 3) = \frac{λ^{2} n^{2}}{σ^{2}} E ({‖ \bar{X} ({\bar{w}}^{l} - {\hat{w}}^{l}) ‖}^{2}) \leq \frac{M^{2}}{λ^{2} n^{2}} \frac{λ^{2} n^{2}}{σ^{2}} E ({(\sum_{t = l - γ}^{l - 1} ‖ β^{t} - β^{t + 1} ‖)}^{2}) (Proposition 2) \leq \frac{M^{2}}{σ^{2}} E (γ {(\sum_{t = l - γ}^{l - 1} ‖ β^{t} - β^{t + 1} ‖)}^{2}) (Cauchy Schwarz Inequality) \leq \frac{γ M^{2}}{σ^{2}} E (γ (\sum_{t = 1}^{γ} ρ^{t} {‖ β^{l} - β^{l + 1} ‖}^{2})) (Lemma 6) \leq \frac{γ M^{2}}{σ^{2} n} (\sum_{t = 1}^{γ} ρ^{t}) E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}) (Proposition 1) \leq \frac{γ^{2} M^{2}}{σ^{2} n} ρ^{γ} E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}) \leq \frac{γ^{2} M^{2} e^{2}}{σ^{2} n} E ({‖ β^{l} - {\tilde{β}}^{l + 1} ‖}^{2}) . (Proposition 4)

(26)

Moreover,

E ({‖ {\bar{β}}^{l} - β^{l + 1} ‖}^{2}) = E ({‖ {\bar{β}}^{l + 1} - {\tilde{β}}^{l + 1} + {\tilde{β}}^{l + 1} - β^{l} ‖}^{2}) \leq E (2 ({‖ {\bar{β}}^{l + 1} - {\tilde{β}}^{l + 1} ‖}^{2} + {‖ {\tilde{β}}^{l + 1} - β^{l} ‖}^{2})) (Cauchy-Schwarz) \leq 2 (1 + \frac{γ^{2} M^{2} e^{2}}{σ^{2} n}) E ({‖ {\tilde{β}}^{l + 1} - β^{l} ‖}^{2})

(27)

The bound of the increase of local objective function value by

E (Q^{σ} (β^{l + 1})) - E (Q^{σ} (β^{l})) = E (- (Q^{σ} (β^{l}) - Q^{σ} ({\bar{β}}^{l + 1}))) - E ((Q^{σ} ({\bar{β}}^{l + 1}) - Q^{σ} (β^{l + 1}))) \geq E (\frac{σ {‖ x_{i (l)} ‖}^{2}}{2} {‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2}) - E (\frac{L_{\max}}{2} {‖ β^{l + 1} - {\bar{β}}^{l + 1} ‖}^{2}) (Proposition 5) \geq \frac{R_{\min}}{2 n} E ({‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2}) - \frac{L_{\max}}{2 n} E ({‖ {\tilde{β}}^{l + 1} - {\bar{β}}^{l + 1} ‖}^{2}) \geq \frac{R_{\min}}{2 n} E ({‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2}) - \frac{L_{\max}}{2 n} \frac{γ^{2} M^{2} e^{2}}{σ^{2} n} E ({‖ {\tilde{β}}^{l + 1} - {\bar{β}}^{l} ‖}^{2}) (Eq. 26) \geq \frac{R_{\min}}{2 n} E ({‖ β^{l} - {\bar{β}}^{l + 1} ‖}^{2}) - \frac{2 L_{\max}}{2 n} \frac{γ^{2} M^{2} e^{2}}{σ^{2} n} (1 + \frac{γ^{2} M^{2} e^{2}}{σ^{2} n}) E ({‖ {\bar{β}}^{l + 1} - β^{l} ‖}^{2}) (Eq. (27)) \geq \frac{R_{\min}}{2 n} (1 - \frac{2 L_{\max}}{2 n} (1 + \frac{γ^{2} M^{2} e^{2}}{σ^{2} n}) (\frac{γ^{2} M^{2} e^{2}}{σ^{2} n})) \times E ({‖ {\bar{β}}^{l + 1} - β^{l} ‖}^{2}) \geq \frac{κ R_{\min}}{2 n} (1 - \frac{2 L_{\max}}{2 n} (1 + \frac{γ^{2} M^{2} e^{2}}{σ^{2} n}) (\frac{γ^{2} M^{2} e^{2}}{σ^{2} n})) \times E ({‖ P s (β^{l}) - β^{l} ‖}^{2}) \geq \frac{κ R_{\min}}{2 n L_{\max}} (1 - \frac{2 L_{\max}}{2 n} (1 + \frac{γ^{2} M^{2} e^{2}}{σ^{2} n}) (\frac{γ^{2} M^{2} e^{2}}{σ^{2} n})) \times E (Q^{σ *} - Q^{σ} (β^{l}))

Therefore,

Q^{σ *} - E (Q^{σ} (β^{l + 1})) = Q^{σ *} - E (Q^{σ} (β^{l})) - (E (Q^{σ} (β^{l + 1}) - E (Q^{σ} (β^{l})))) \leq η (Q^{σ *} - E (Q^{σ} (β^{l})))

Let us assume that $β_{[k]}^{*}$ is the optimal solution of the subproblem (4) denoted as:

β_{[k]}^{*} = \arg \max_{β_{[k]} \in R^{n} k} Q^{σ} (β_{[k]}; \bar{w}) .

(28)

According to above proof of Lemma 7, the local atomic solver has a linear convergence rate in expectation, that is,

Q^{σ} (β_{[k]}^{*}; \bar{w}) - E (Q^{σ} (β_{[k]}^{j + 1}; \bar{w})) \leq η (E (Q^{σ} (β_{[k]}^{*}; \bar{w}) - Q^{σ} (β_{[k]}^{j}; \bar{w})))

It is obvious that Θ = η^RH. Thus, we can easily get the induction as

Q^{σ} (β_{[k]}^{*}; \bar{w}) - E (Q^{σ} (β_{[k]}^{R H}; \bar{w})) \leq η (Q^{σ} (β_{[k]}^{*}; \bar{w}) - E (Q^{σ} (β_{[k]}^{R H - 1}; \bar{w}))) \leq η^{2} (E (Q^{σ} (β_{[k]}^{*}; \bar{w}) - Q^{σ} (β_{[k]}^{R H - 2}; \bar{w}))) \leq \dots \leq Θ (Q^{σ} (β_{[k]}^{*}; \bar{w}) - E (Q^{σ} (β_{[k]}^{0}; \bar{w}))) .

Notice that $β_{[k]}^{0}$ are the starting points of the local atomic solver and $β_{[k]}^{R, H}$ are the final results of β_[k] of the local atomic solver. So the following equations hold for the global problem:

β_{[k]}^{0} = α_{[k]} β_{[k]}^{R, H} - β_{[k]}^{0} = δ_{[k]} β_{[k]}^{*} - β_{[k]}^{0} = δ_{[k]}^{*}

Therefore, we have:

Ε [D_{k} (δ_{[k]}^{*}; v, α_{[k]}) - D_{k} (δ_{[k]}; v, α_{[k]})] \leq Θ [D_{k} (δ_{[k]}^{*}; v, α_{[k]}) - D_{k} (0; v, α_{[k]})]

with Θ = η^RH. □

4.2. Convergence of global solution

Although we have showed that the local subproblem solver outputs a Θ-approximate solution, we cannot directly apply the results of [11] for the global solution because our algorithm uses updates from only a subset S ≤ K of workers which is unlike the synchronous all-reduce of the updates from all workers used in [11]. We need to handle this asynchronous nature of the global updates, just like we handled asynchronous updates for the local subproblem.

Let us consider the global updates in the order the master computed them (at global time t in Fig. 3). To prove convergence, it is customary to show that the global objective progresses sufficiently in each round, i.e., there is sufficient change from D(α^(t)) to D(α^(t+1)) where α^(t) denotes the value of the dual variable α distributed as $α_{[k]}^{(t)}$ across all the workers k at the time master computed tth global update v^(t). For simplicity we assume that each worker updates $α_{[k]}^{(t)}$ in step 11 of Algorithm 1 as soon as it receives global update from the master. Thus, α^(t+1) = α^(t) + νδ^(t) where $δ^{(t)} = \sum_{k} δ_{[k]}^{(t)}$ and $δ_{[k]}^{(t)}$ denotes the increment to $α_{[k]}^{(t)}$ computed by worker k if $k \in P_{S}^{(t)}$ , 0 otherwise.

If $k \in P_{S}^{(t)}$ then the update $δ_{[k]}^{(t)}$ has already been included in v^(t). However, if $k \notin P_{S}^{(t)}$ then it may not be included. Let ξ be such that for all l ≤ ξ and for all k, $δ_{[k]}^{(l)}$ has been included in v^(t). By the design of our algorithm, t − Γ < ξ ≤ t. Let ${\hat{α}}^{(t)}$ be defined as follows: ${\hat{α}}_{[k]}^{(t)} = α_{[k]}^{(t)}$ , $\forall k \in P_{S}^{(t)}$ and $= α_{[k]}^{(ξ - 1)}$ for the latest (ξ − 1) for which the update is already included in global v, $\forall k \notin P_{S}^{(t)}$ . Let w^(t), ${\hat{w}}^{(t)}$ be w(α^(t)) and $w ({\hat{α}}^{(t)})$ respectively. Note that $w^{(t)} = {\hat{w}}^{(t)} + \frac{ν}{λ n} \sum_{l = ξ}^{t} X δ^{(l)}$ . For a vector expression (χ), let (χ)_i represent the ith element of the vector resulting from the expression (χ).

Lemma 8. For any dual α^(t), $δ^{(t)} \in R^{n}$ , primal ${\hat{w}}^{(t)} = w ({\hat{α}}^{(t)})$ and real values ν and σ satisfying (5), it holds that

D (α^{(t + 1)}) = D (α^{(t)} + ν \underset{k \in P_{S}^{(t)}}{Σ} δ_{[k]}^{(t)}) \geq (1 - ν) D (\hat{α}) + ν \underset{k \in P_{S}^{(t)}}{Σ} D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w}) - \frac{λ}{2} (\frac{2 v}{λ n} \underset{k \notin P_{S}^{(t)}}{Σ} w {(\hat{α})}^{⊺} X (Σ_{l = ξ}^{t} δ_{[k]}^{(l)}) + {(\frac{ν}{λ n})}^{2} {‖ \underset{k \notin P_{S}^{(t)}}{Σ} Σ_{c = ξ}^{t} X δ_{[k]}^{(c)} ‖}^{2}) - \frac{2 ν^{2}}{λ n^{2}} {(\underset{k \notin P_{S}^{(t)}}{Σ} Σ_{l = ξ}^{t} δ_{[k]}^{(l)})}^{⊺} X^{⊺} X \underset{k \in P_{S}}{Σ} δ_{[k]}^{(t)} - \frac{1}{n} \underset{k \notin P_{S}^{(t)}}{Σ} (\underset{i \in I_{k}}{Σ} ϕ_{i}^{*} (- \hat{α} - ν Σ_{c = ξ}^{t} {δ_{[k]}^{(c)})}_{i})) .

(29)

Proof. Assume that $I = ⋃_{k \in P_{S}} I_{k}$ . Then, we have

D (α^{(t)} + ν \sum_{k \in P_{S}^{(t)}} δ_{[k]}^{(t)}) = - \frac{1}{n} ϕ_{i}^{*} (- {\hat{α}}_{i} - ν {(\sum_{k \in P_{S}^{(t)}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)} + \sum_{k \in P_{S}^{(t)}} δ_{[k]}^{(t)})}_{i}) - \frac{λ}{2} ‖ \frac{1}{λ n} X (\hat{α} + ν \sum_{k \notin P_{S}^{(t)}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)} + ν \sum_{k \in P_{S}^{(t)}} δ_{[k]}^{(t)}) ‖^{2} = - \frac{1}{n} \sum_{k \in P_{S}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- (1 - ν) {\hat{α}}_{i} - ν {({\hat{α}}_{i} + δ_{[k]}^{(t)})}_{i})) - \frac{1}{n} \sum_{k \in P_{S}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} ((- \hat{α} - ν \sum_{c = ξ}^{t} δ_{[k]}^{(c)})_{i})) - \frac{λ}{2} (‖ w (\hat{α}) ‖^{2} + \frac{2 ν}{λ n} \sum_{k \in P_{S}^{(t)}} w {(\hat{α})}^{⊺} X δ_{[k]}^{(t)} + {(\frac{ν}{λ n})}^{2} ‖ \sum_{k \in P_{S}^{(t)}} X δ_{[k]}^{(t)} ‖^{2}) - \frac{λ}{2} (\frac{2 ν}{λ n} \sum_{k \in P_{S}^{(t)}} w {(\hat{α})}^{⊺} X (\sum_{l = ξ}^{t} δ_{[k]}^{(l)}) + {(\frac{ν}{λ n})}^{2} ‖ \sum_{k \notin P_{S}^{(t)}} \sum_{c = ξ}^{t} X δ_{[k]}^{(c)} ‖^{2}) - \frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{S}^{(t)}} \sum_{l = ξ}^{t} δ_{[k]}^{(l)})}^{⊺} X^{⊺} X \sum_{k \in P_{S}} δ_{[k]}^{(t)}) \geq - \frac{1}{n} \sum_{k \in P_{S}^{(t)}} (\sum_{i \in I_{k}} ((1 - ν) ϕ_{i}^{*} (- {\hat{α}}_{i}) + ν ϕ_{i}^{*} {(- ({\hat{α}}_{i} + δ_{[k]}^{(t)})}_{i}))) - \frac{λ}{2} (‖ \hat{w} ‖^{2} + \frac{2 ν}{λ n} \sum_{k \in P_{S}^{(t)}} {\hat{w}}^{⊺} X δ_{[k]}^{(t)} + {(\frac{ν}{λ n})}^{2} ‖ \sum_{k \in P_{S}^{(t)}} X δ_{[k]}^{(t)}) ‖^{2}) - \frac{λ}{2} (\frac{2 ν}{λ n} \sum_{k \notin P_{S}^{(t)}} w {(\hat{α})}^{⊺} X (\sum_{l = ξ}^{t} δ_{[k]}^{(l)})) + {(\frac{ν}{λ n})}^{2} ‖ \sum_{k \notin P_{S}^{(t)}} \sum_{c = ξ}^{t} X δ_{[k]}^{(c)} ‖ 2) - \frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{S}^{(t)}} \sum_{I = ξ}^{t} δ_{[k]}^{(l)}))}^{⊺} X^{⊺} X \sum_{k \in P_{S}^{(t)}}^{t} δ_{[k]}^{(t)} - \frac{1}{n} \sum_{k \notin P_{S}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} ((- \hat{α} - ν \sum_{c = ξ}^{t} δ_{[k]}^{(c)})_{i})) (mean value theorem, concave) = \underset{(1 - ν) D (\hat{α})}{- \frac{1}{n} \sum_{k = 1}^{n} (\sum_{i \in I_{k}} (1 - ν) ϕ_{i}^{*} (- {\hat{α}}_{i})) - (1 - ν) \frac{λ}{2}} ‖ w (\hat{α}) ‖^{2} + \frac{1}{n} \sum_{k \notin P_{S}^{(t)}} (\sum_{i \in I_{k}} (1 - v) ϕ_{i}^{*} (- {\hat{α}}_{i})) + v \sum_{k \notin p_{s}^{(t)}} (- \frac{1}{n} \sum_{i \in I_{k}} ϕ_{i}^{*} (- {({\hat{α}}_{i} + δ_{[k]}^{(t)})}_{i} - \frac{1}{S} \frac{λ}{2} | | w (\hat{α}) | |^{2} - \frac{1}{n} w {(\hat{α})}^{T} X δ_{[K]}^{(t)} - \frac{λ}{2} σ | | \frac{1}{λ n} X δ_{[K]}^{(t)} | |^{2}) - \frac{λ}{2} (\frac{2 v}{λ n} \sum_{k \in p_{S}^{(t)}} w {(\hat{α})}^{T} X (\sum_{c = ξ}^{t} δ_{[K]}^{(t)})) + {(\frac{v}{λ n})}^{2} | | \sum_{k \notin p_{s}^{(t)}} \sum_{c = ξ}^{t} δ_{[K]}^{c)}) | | 2) - \frac{2 v}{λ n} {(\sum_{k \notin p_{s}^{(t)}} \sum_{I = ξ}^{t} δ_{[K]}^{(t)}))}^{T} X^{T} X \sum_{k \in p_{s}^{(t)}}^{t} δ_{[K]}^{(t)}) - \frac{1}{n} \sum_{k \notin p_{S}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} ((- \hat{α} - v) \sum_{c = ξ}^{t} δ_{[K]}^{(t)})_{i})) = (1 - v) D (\hat{α}) + v \sum_{k \in p_{s}^{(t)}} D_{k} δ_{[K]}^{(t)}; α_{[k]}, \hat{w}) + \frac{1 - v}{n} \sum_{k \notin p_{s}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- {\hat{α}}_{i})) - \frac{λ}{2} (\frac{2 v}{λ n} \sum_{k \in p_{S}^{(t)}} w {(\hat{α})}^{T} X (\sum_{c = ξ}^{t} δ_{[K]}^{(t)})) + {(\frac{v}{λ n})}^{2} | | \sum_{k \notin p_{s}^{(t)}} \sum_{c = ξ}^{t} δ_{[K]}^{c)}) | | 2) - \frac{2 v^{2}}{λ n^{2}} {(\sum_{k \notin p_{s}^{(t)}} \sum_{I = ξ}^{t} δ_{[K]}^{(t)}))}^{T} X^{T} X \sum_{k \in p_{s}^{(t)}}^{t} δ_{[K]}^{(t)}) - \frac{1}{n} \sum_{k \notin p_{S}^{(t)}} (\sum_{i \in I_{k}} ϕ_{i}^{*} ((- \hat{α} - v) \sum_{c = ξ}^{t} δ_{[K]}^{(t)})_{i}))

Assumption 5 (Bounded Delay of Global Updates, Γ). There exists a $ϱ < e^{\frac{2}{Γ + 1}}$ such that

{‖ δ^{(t - 1)} ‖}^{2} \leq ϱ {‖ δ^{(t)} ‖}^{2} .

(30)

Lemma 9 (Global Convergence at Each Iteration). If $ϕ_{i}^{*}$ are all (1/μ)-strongly convex and Assumptions 2-5 are satisfied then for any s ∈ [0, 1], any round t of Algorithm 2 satisfies

E [D (α^{(t + 1)}) - D (α^{(t)})] \geq Ψ (1 - Θ) (s G (\hat{α}) - \frac{σ}{2 λ} {(\frac{s}{n})}^{2} \hat{R})

(31)

where

Ψ ≔ ν (1 - \frac{(Γ + 1) e^{2} M L_{\max}}{λ n} + \frac{S Γ M L_{\max}}{K λ n}) \leq 1, and

(32)

\hat{R} ≔ - \frac{λ μ n (1 - s)}{σ s} {‖ \hat{u} - \hat{α} ‖}^{2} + {\sum_{k = 1}^{K} ‖ X {(\hat{u} - \hat{α})}_{[k]} ‖}^{2},

(33)

for $\hat{u} \in R^{n}$ with $- {\hat{u}}_{i} \in \partial ϕ_{i} (w {(\hat{α})}^{⊺} x_{i})$ .

Proof. For sake of notation, we will write α instead of α^t, w instead of w(α^t), $\hat{w}$ instead of $w (\hat{α})$ and δ instead of δ^t.

Now, the expected change of the dual objective is

E [D (α^{t}) - D (α^{(t + 1)})] = E [D (α^{t}) - D (\hat{α}) + D (α) - D (α^{(t + 1)})] = E [D (α^{t}) - D (α)] + E [D (\hat{α}) - D (α^{(t + 1)})]

Thus, it is a summation of two parts. Let us estimate both parts as follows,

E [D (α^{t}) - D (\hat{α})] = E [- \frac{1}{n} \sum_{i \notin I} ϕ_{i}^{*} (- {\hat{α}}_{i} - ν \sum_{c = ξ}^{t} δ^{(c)}) - \frac{1}{n} \sum_{k \in P_{S}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- {\hat{α}}_{i})) - \frac{λ}{2} {‖ \frac{1}{λ n} \sum_{k \notin P_{s}} X ({\hat{α}}_{[k]} + ν \sum_{c = ξ}^{t} δ_{[k]}^{(c)}) ‖}^{2} + \frac{1}{n} \sum_{k = 1}^{n} ϕ^{*} (- {\hat{α}}_{[k]}) + \frac{λ}{2} {‖ \hat{w} ‖}^{2}] = E [- \frac{1}{n} \sum_{i \notin I} ϕ_{i}^{*} (- {\hat{α}}_{i} - ν \sum_{c = ξ}^{t} δ^{(c)}) + \frac{1}{n} \sum_{k \notin P_{s}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- {\hat{α}}_{i})] - \frac{λ}{2} E [\frac{2 ν}{λ n} \sum_{k \notin P_{S}} w ({\hat{α}}^{⊺} X (\sum_{c = ξ}^{t} δ_{[k]}^{(i)}) + {(\frac{ν}{λ n})}^{2} {‖ \sum_{k \notin P_{S}} \sum_{c = ξ}^{t} X δ_{[k]}^{(c)} ‖}^{2}] E [D (\hat{α}) - D (α^{(t + 1)})] = E [D (\hat{α}) - D (α + ν \sum_{k \in P_{S}} δ_{[k]}^{(t)})] \leq E [D (α) - (1 - ν) D (\hat{α}) - ν \sum_{k \in P_{S}} D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w}) + \frac{ν}{n} \sum_{k \notin P_{S}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- {\hat{α}}_{i}))] - E [- \frac{1}{n} \sum_{k \notin P_{S}} (\sum_{i \in I_{k}} ϕ_{i}^{*} {((- \hat{α} - ν \sum_{c = ξ}^{t} δ_{[k]}^{(c)})}_{i})) + \frac{1}{n} \sum_{k \notin P_{S}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- {\hat{α}}_{i}))] + \frac{λ}{2} E [\frac{2 ν}{λ n} \sum_{k \notin P_{S}} w {(\hat{α})}^{⊺} X (\sum_{c = ξ}^{t} δ_{[k]}^{(c)}) + {(\frac{ν}{λ n})}^{2} {‖ \sum_{k \notin P_{S}} \sum_{c = ξ}^{t} X δ_{[k]}^{(c)} ‖}^{2}] + E [\frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{S}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)})}^{⊺} X^{⊺} X \sum_{k \in P_{S}} δ_{[k]}^{(t)}] (Lemma 8)

Therefore,

(sum of previous two inequalities) = ν E [D (α^{t}) - D (α^{(t + 1)})] = E [D (α^{t}) - D (α^{t})] + E [D (α) - D (α^{(t + 1)})] \leq E [D (α) - (1 - ν) D (α) - ν \sum_{k \in P_{s}} D_{k} (δ_{[k]}^{(t)}; α_{[k]}, w) + \frac{ν}{n} \sum_{k \notin P_{s}} (\sum_{i \in I_{k}} ϕ_{i}^{*} (- α_{i}))] + E [\frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{s}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)})}^{T} X^{T} X \sum_{k \in P_{s}} δ_{[k]}^{(t)}] (s u m o f p r e v i o u s t w o i n e q u a l i t y) = ν E [D (α) - \sum_{k = 1}^{K} D_{k} (δ_{[k]}^{*}; α_{[k]}, w) + \sum_{k = 1}^{K} D_{k} (δ_{[k]}^{*}; α_{[k]}, w) - \sum_{k = 1}^{K} D_{k} (0; α_{[k]}, w)] + E [\frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{s}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)})}^{T} X^{T} X \sum_{k \in P_{s}} δ_{[k]}^{(t)}] \leq ν (Θ (\sum_{k = 1}^{K} D_{k} (δ_{[k]}^{*}; α_{[k]}, w) - \underset{D (\hat{α})}{\underset{︸}{\sum_{k = 1}^{K} D_{k} (0; α_{[k]}, w)}}) + D (α) - \sum_{k = 1}^{K} D_{k} (δ_{[k]}^{*}; α_{[k]}, w)) + E [\frac{ν^{2}}{λ n^{2}} {(\sum_{c = ξ}^{t} δ_{[k]}^{(c)})}^{T} X^{T} X \sum_{k \in P_{s}} δ_{[k]}^{(t)}] = ν (1 - Θ) D (α) - \sum_{k = 1}^{K} D_{k} (δ_{[k]}^{*}; α_{[k]}, w) + E [\frac{2 ν^{2}}{λ n^{2}} {(\sum_{k \notin P_{s}} \sum_{c = ξ}^{t} δ_{[k]}^{(c)})}^{T} X^{T} X \sum_{k \in P_{s}} δ_{[k]}^{(t)}] \leq ν (1 - Θ) D (α) - \sum_{k \in P_{s}} D_{k} (δ_{[k]}^{*}; α_{[k]}, w) + \frac{ν}{λ n} \underset{A}{\underset{︸}{(E [{‖ \sum_{k \notin P_{s}} X (\sum_{c = ξ}^{t} δ_{[k]}^{(c)}) ‖}^{2}] + E [{‖ \sum_{k \in P_{s}} X δ_{[k]}^{(t)} ‖}^{2}])}} (a^{2} + b^{2} \geq 2 a b)

Before bounding the term A, we need the following proposition:

Proposition 10.

E [{‖ X δ_{[k]}^{(t)} ‖}^{2}] = \frac{1}{S} E [\sum_{k \in P_{s}} {‖ X δ_{[k]}^{(t)} ‖}^{2}] = \frac{1}{K} E [\sum_{k = 1}^{K} {‖ X δ_{[k]}^{(t)} ‖}^{2}] \forall k \in {1, \dots, K} .

Now, let us bound the term A. We have

A = E [{‖ \sum_{k \notin P_{s}} X (\sum_{c = ξ}^{t} δ_{[k]}^{(c)}) ‖}^{2}] + E [{‖ \sum_{k \in P_{s}} X δ_{[k]}^{(t)} ‖}^{2}] \leq E [(Γ + 1) {\sum_{c = t - Γ}^{t} ‖ \sum_{k \notin P_{s}} X δ_{[k]}^{c} ‖}^{2}] + E [S \sum_{k \in P_{s}} {‖ X δ_{[k]}^{(t)} ‖}^{2}] (Cauchy Schwarz Inequality) \leq E [\frac{(K - S) (Γ + 1) M}{K} \sum_{k = 1}^{K} \sum_{c = t - Γ}^{t - 1} {‖ δ_{[k]}^{c} ‖}^{2}] + E [\frac{S M}{K} \sum_{k = 1}^{K} {‖ δ_{[k]}^{(t)} ‖}^{2}] (Propositions 2 & 10) \leq E [\frac{(K - S) (Γ + 1) M}{K} (\sum_{c = t - Γ}^{t} ϱ^{c}) \sum_{k = 1}^{K} {‖ δ_{[k]}^{(t)} ‖}^{2}] + E [\frac{S M}{K} \sum_{k = 1}^{K} {‖ δ_{[k]}^{(t)} ‖}^{2}] (By (30)) \leq \frac{(K - S) (Γ + 1) M L_{\max}}{K} \sum_{c = t - Γ}^{t} ϱ^{c} (D (\hat{α}) - \sum_{k = 1}^{K} D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w})) + \frac{S M L_{\max}}{K} \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{(t)}; α_{[k]}^{(t)}, \hat{w})) (Proposition 5)

where $D (\hat{α}) = D_{k} (0; α_{[k]}, \hat{w})$ . Thus, Eq. (9) can be rewritten as,

E [D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w}) - D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w})] \leq Θ (D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w}) - D (\hat{α})) + D (\hat{α}) - D (\hat{α}) D (\hat{α}) - D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w}) \leq (1 - Θ) D (\hat{α}) - (1 - Θ) D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w}) D (\hat{α}) - D_{k} (δ_{[k]}^{(t)}; α_{[k]}, \hat{w}) \leq - (1 - Θ) (D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w}) - D (\hat{α}))

(34)

Then, A can be bounded as,

A \leq - \frac{(K - S) (Γ + 1) M L_{\max}}{K} ϱ^{Γ + 1} (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w})) - \frac{S M L_{\max}}{K} (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w})) (By (34)) \leq - \frac{(K - S) (Γ + 1) e^{2} M L_{\max}}{2} (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w})) - \frac{S M L_{\max}}{K} (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*}; α_{[k]}, \hat{w})) (Assumption 5)

By substituting A, we have

E [D (α^{t}) - D (α^{(t + 1)})] \leq ν (1 - \frac{(Γ + 1) e^{2} M L_{\max}}{λ n} + \frac{S Γ M L_{\max}}{K λ n}) (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*})) E [D (α^{t}) - D (α^{(t + 1)})] \leq Ψ (1 - Θ) \sum_{k = 1}^{K} (D (\hat{α}) - D_{k} (δ_{[k]}^{*}))

Using the Eq. C in the proof of Lemma 5 in [11], we can show that

E [D (α^{t}) - D (α^{(t + 1)})] \leq Ψ (1 - Θ) (- s G (\hat{α}) - \frac{1}{2 μ} (1 - s) s \frac{1}{n} {‖ \hat{u} - \hat{α} ‖}^{2} + \frac{σ}{2 λ} {(\frac{s}{n})}^{2} \sum_{k = 1}^{K} ‖ X {(\hat{u} - \hat{α})}_{[k]} ‖) = Ψ (1 - Θ) (- s G (\hat{α}) + \frac{σ}{2 λ} {(\frac{s}{n})}^{2} \hat{R}) □

Remark. When S (the minimal number of workers required to update before a global update is communicated) is fixed in (32), Ψ will be approach 0 when Γ (the maximal delay allowed, for the slowest worker) becomes larger and larger since (Γ + 1)e² > Γ. In other words, the improvement from D(α^(t)) to D(α^(t+1)) at iteration t becomes smaller when a larger delay is allowed among the workers.

When Γ is fixed in (32), Ψ will be larger when S is set larger. The improvement from D(α^(t)) to D(α^(t+1)) at iteration t will be more significant when more updates from different workers are taken into account. However, when Γ = 0, Ψ will be independent from S because all updates have to be gathered by the master.

Using the main results in [11] and combining Lemma 7 with Lemma 9 yield the following two convergence results, one for smooth loss functions and the other for the Lipschitz continuous loss functions. The theorems use the quantities σ_max = max_k σ_k, σ_sum = ∑_k σ_kn_k where ∀k, $σ_{k} = \max_{α_{[k]} \in R^{n}} {∥ X α_{[k]} ∥}^{2} ∕ {∥ α_{[k]} ∥}^{2}$ .

Theorem 11 (Global convergence for (1/μ)-Smooth Functions). If the loss functions ϕ_i are all (1/μ)-smooth, then in T₁ iterations Algorithm 2 finds a solution with objective atmost ε_D from the optimal, i.e., $E [D (α^{*}) - D (α^{(T_{1})})] \leq ∊_{D}$ whenever $T_{1} \geq C_{1} \log \frac{1}{∊_{D}}$ where $C_{1} = \frac{1}{Ψ (1 - ϴ)} (1 + \frac{σ_{\max} σ}{ν λ n})$ and Θ is given by (25). Furthermore, in T₂ iterations, it finds a solution with duality gap atmost ϵ_gap, i.e., $E [P (w (α^{(T_{2})})) - D (α^{(T_{2})})] \leq ∊_{gap}$ whenever $T_{2} \geq C_{1} \log \frac{C_{1}}{∊_{D}}$ .

Theorem 12 (Global convergence for L-Lipschitz functions). If the loss functions ϕ_i are all L-Lipschitz, then in T₁ iterations Algorithm 2 finds a solution with duality gap at most ϵ_gap, i.e., $E [P (w (\overset{‒}{α})) - D (\overset{‒}{α})] \leq ∊_{gap}$ for the average iterate $\overset{‒}{α} = \frac{1}{T_{1} - T_{0}} \sum_{t = T_{0} + 1}^{T_{1} - 1} α^{(t)}$ whenever $T_{1} \geq T_{0} + \max {⌈ \frac{1}{Ψ (1 - ϴ)} ⌉, \frac{4 L^{2} σ_{sum} σ}{λ n^{2} ∊_{gap} Ψ (1 - ϴ)}}$ , and $T_{0} \geq \max {0, ⌈ \frac{1}{Ψ (1 - ϴ)} \log \frac{2 λ n^{2} (D (α^{*} - D (α^{(0)})}{4 L^{2} σ_{sum} σ} ⌉} + \max {0, \frac{2}{Ψ (1 - ϴ)} (\frac{8 L^{2} σ_{sum} σ}{λ n^{2} ∊_{gap}} - 1)}$ and Θ is given by (25).

Theorem 12 establishes the convergence for L-Lipschitz continuous loss functions, and Theorem 11 proves a linear convergence rate for smooth convex loss functions.

5. Experimental Results

We implemented our algorithm in C++ using Open MPI and OpenMP. All our experiments were conducted on the Biowulf cluster at the National Institutes of Health, USA, using up to K = 16 nodes where each node has 58 GB main memory and R = 16 cores each with a 2.6 GHz Xeon E5–2650v2 processor and 20MB secondary cache. Though the cores in the cluster are hyperthreading enabled, we did not use hyperthreading mode for our experiments. Each node in the cluster runs exactly one MPI task corresponding to a worker which in turn runs one OpenMP thread on each core available within the node. The main thread in a worker handles the inter-node communication and the root MPI task works as the master. For MPI node scheduling we used the slurm task scheduler with the settings --ntasks–per–node=1 and --threads–per–core=1 whereas for OpenMP thread scheduling we used a simple CPU affinity scheduler that always assigned thread i to physical core i where i ∈ {1, …, 16}.

5.1. Datasets

We evaluated our algorithm against three other algorithms on four binary classification datasets, rcv1, webspam, kddb and splicesite from the LIBSVM [3] website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) as shown in Table 1. For datasets rcv1 and kddb we used separate training and test data files downloaded directly from the website. As there was no separate test data file for webspam, we divided the data file into two parts and used the first part containing 80% of the datapoints as the training set and the remaining as the test set. For splicesite we used the test file on the website as our training set (to better test the scalability of our algorithm as the test file was bigger). We chose the datasets in such a way that we had representatives of several scenarios: data sample-heavy rcv1 where n ⪢ d, feature-heavy webspam where n ⪡ d, both sample and feature heavy kddb where both n ≈ d are high, and the bigdataset splicesite which was more dense in both ways.

Table 1.

Datasets.

Dataset details	rcv1	webspam	kddb	splicesite
Training set	rcv1_train.binary	webspam_*_trigram	kddb	splice_site.t
File size	1.2 GB	20 GB	5.1 GB	280 GB
Training size n	677,399	280,000	19,264,097	4,627,840
Number of features d	47,236	16,609,143	29,890,095	11,725,480
Non-zero entries nnz	49,556,258	1,045,051,224	566,345,888	15,383,587,858

Test set	rcv1_test.binary	webspam_*_trigram	kddb.t	-
Test size	677,399	69,632	748,401	-

Open in a new tab

5.2. Comparison of algorithms

We experimented with the following four algorithms:

Baseline: a sequential implementation of stochastic dual coordinate ascent (DCA) [6] which runs on a single core of a single node
CoCoA+: a MPI based distributed implementation of stochastic DCA [11] which runs on multiple nodes, however, each node uses a single core
PassCoDe: an OpenMP based parallel implementation of stochastic DCA [7] which runs on a single node, however, the node uses multiple cores
Hybrid-DCA: an OpenMP+MPI implementation of our hybrid parallel distributed approach for stochastic DCA which runs on multiple nodes and each node uses multiple cores.

5.3. Parameter settings

We evaluated all the four algorithms for the hinge loss, though other loss functions could be tested too, with the regularization parameter λ. In our experiments with three values λ ∈ {10⁻³, 10⁻⁴, 10⁻⁵}, we observed similar patterns of results and we reported the results for λ = 10⁻⁴ only. All the three parallel/distributed algorithms, namely PassCoDe, CoCoA+, Hybrid-DCA, have a global parameter G denoting the number of basic updates to the dual variable α that are made in each global round. The parameter G acts as a tradeoff between the progress on the dual objective and the time taken in each global round. In the original implementation in [7] PassCoDe sets G equal to the number of datapoints n in the dataset whereas CoCoA implementation in [11] uses a smaller G than n.

We experimented with different values of G and found G = 40000, 30000, 2000000 for the datasets rcv1, webspam, kddb, respectively, gave the best results in our empirical study. In our implementation, for PassCoDe on t cores each thread made about H = G/t local updates, for PassCoDe on p nodes, each MPI task made H = G/p local updates and for Hybrid-DCA on p nodes each with t cores, each thread within an MPI task made H = G/(p × t) local updates so that all the three algorithms made total G updates in a global round. Though the sequential algorithm Baseline did not have any local iteration parameter, for better comparison we computed performance metrics such as time taken after every G updates and treated such G updates as a global round. We set aggregation parameters ν = 1, and the scaling parameter σ = K for both CoCoA+, Hybrid-DCA as recommended in [11].

Our Hybrid-DCA algorithm has additional parameters, the bounded barrier S and bounded delay Γ so that updates from only S out of p workers are incorporated in each global round with a maximum delay of Γ rounds for any update from the workers. In our implementation of Hybrid-DCA, we treated the two cases slightly differently: synchronous Hybrid-DCA with S = p where the updates from the workers were merged using the collective operation MPI_Allreduce, and asynchronous Hybrid-DCA with S < p where the updates from a subset of S out of p workers were merged and distributed using basic MPI “send and receive” commands. For all our experiments, we ran algorithms 10 times for each setting and reported the average of measured values.

5.4. Optimization performance

Fig. 5 shows the progress of duality gap achieved by the four algorithms on the three relatively smaller datasets rcv1, webspam and kddb. We chose the number of nodes (p ≤ K) and the number of cores (t ≤ R) per node such that the total number of worker cores (p × t) was the same (16) for all algorithms except Baseline. For Hybrid-DCA we set the parameters as the bounding barrier S = p and the delay Γ = 1 so that updates from all workers were incorporated in each global round. However, for p = t = 4 we also experimented with S = 3 and Γ = 4 to compare how Hybrid-DCA performed when updates from only 3 out of 4 workers were incorporated in each round with a maximum delay of 4 rounds for any update from the workers. The duality gap was measured as P(w) − D(α) where the primal estimate w was computed as (w = v) w = w(α) at the end of each global round. However, when S < K it was not possible for the master in Hybrid-DCA to gather the parts of P(w) from all workers at the end of each global round. To workaround in such a case, we let the master temporarily store w in disk after each round and at the end of all stipulated rounds, the workers computed the respective parts of P(w) from the stored w and the master computed the duality gap using a series of synchronous all-reduce computations from all the workers.

The bottom row of Fig. 5 shows the progress of the duality gap over time, while the top row shows the progress after each global round. In terms of progress across global rounds, the algorithms performed somewhat equally except for S < p where Hybrid-DCA had slightly slower progress on duality gap as the updates from one of the workers was missing in each round. In terms of time, there was no clear winner of the three parallel/distributed algorithms. CoCoA+ showed an advantage over PassCoDe on datasets with a smaller number of datapoints n, such as rcv1 and webspam, because the costly inter-node communication complexity was O(n) per round. For kddb where the dataset had more non-zero data elements, PassCoDe performed better. For all the three datasets the performance of Hybrid-DCA came in between CoCoA+ and PassCoDe by balancing inter-core and inter-node communications. Since the asynchronous Hybrid-DCA with S < p missed updates from some of the workers, took longer time as expected than synchronous Hybrid-DCA with S = p.

5.5. Test accuracy

To compare the quality of the solution obtained by each algorithm we used test datasets from LIBSVM binary repository the details of which are given in Table 1 and in Section 5.1. Note that the webspam dataset in LIBSVM did not have a separate test dataset. We divided it into two parts and took the first 280,000 datapoints in training and the remaining 69, 632 datapoints for test. For each of the empirical datatsets, we computed the accuracy, i.e., the fraction of test datapoints that were classified correctly by each of the algorithms and plotted the results on Fig. 6. In terms of the progress on accuracy across rounds, all the algorithms performed somewhat equally except for Hybrid-DCA with S < p which missed updates from one of the workers. Though, CoCoA+ apparently worked better on kddb, the margin was about 0.0005. However, in the long run all algorithms reached the same accuracy level. The progress on accuracy across time can be explained by the progress on duality gap across time as explained in Section 5.4.

Fig. 6. — Performance of different solvers on three datasets, rcv1 (left column), webspam (middle column), and kddb (right column), in terms of the progress of the duality gap across the number of rounds (top row) and across the wall time taken (bottom row).

5.6. Speedup

Speedup evaluates the improvement in performance of an algorithm as a function of the number of cores used. However, the exact definition of speedup varies in the literature. For example, PassCoDe computes speedup as the improvement in runtime to execute a fixed number of rounds, whereas CoCoA+ uses a more refined notion of speedup for an optimization problem like RRM defined in equation (1) and computes speedup as the improvement in runtime to achieve a fixed level of duality gap. In our experiments, we evaluated the algorithms using both notions of speedup. Let TR_A(p, t, r) and TD_A(p, t, ε) denote the time taken by an algorithm A using p nodes each with t cores to complete r rounds and to achieve duality gap ϵ, respectively. We formally define the two notions of speedup as follows²:

{Speedup}_{A} (p, t, r) = \frac{{TR}_{A} (p, t, r)}{{TR}_{Baseline} (1, 1, r)}, and {Proficiency}_{A} (p, t, ϵ) = \frac{{TD}_{A} (p, t, ϵ)}{{TD}_{Baseline} (1, 1, ϵ)}

We ran sufficient rounds (≥ 100) of each of the four algorithms and computed Speedup_A(p, t, 100) and Proficiency_A(p, t, 10⁻³) for all algorithms A except Baseline, as shown in the top two rows of Fig. 7. PassCoDe can be run only on a single node; so we varied only the number of cores. Because CoCoA+ could use only 1 core per node. We ran CoCoA+ and Hybrid-DCA with t = 1 cores on p ∈ {1, 2, 4, 8, 12, 16, 20, 24, 28, 31} nodes and plotted the results separately. We also ran synchronous Hybrid-DCA on p ∈ {2, 4, 8} nodes each with t ∈ {2, 4, 6, 8, 10, 12, 13, 14, 15, 16} cores, S = p, γ = 1 and plotted the results separately for each p fixed and varying t. For p = 8 nodes and the same set of possible number of cores, we additionally plotted the results for asynchronous Hybrid-DCA with S = 6, Γ = 4.

Fig. 7. — Speedup of different parallel or distributed solvers with respect to the sequential implementation *Baseline*.

Our first observation on the results of speedup experiments was that though proficiency and speedup were computed differently, they turned out to follow a similar trend. In fact, speedup was slightly higher than proficiency as the merging of parallel/distributed updates in every global round reduced the quality of the merged update in comparison with the pure updates of Baseline. We also observed that the speedup and proficiency of the algorithms followed a trend similar to the performance seen in Section 5.4, i.e., CoCoA+ performed better on datasets that were either sample-heavy or feature heavy but not both, PassCoDe performed better on dataset that was both sample-heavy and feature heavy, and Hybrid-DCA performed in between. Furthermore, asynchronous Hybrid-DCA ran slower than synchronous Hybrid-DCA.

While investigating why Hybrid-DCA was slower than our expectation, we observed that the overhead of OpenMP in maintaining the parallel threads was significant on the datasets rcv1 and webspam as evident from the performance of CoCoA.t1 and Hybrid-DCA.t1 where their only difference was the overhead of maintaining a single OpenMP thread. However, this overhead was relatively small in comparison with the overall computation in the sample and feature heavy kddb dataset. However, we also noticed that CoCoA+ and PassCoDe did not scale well beyond 20 nodes and 14 cores, respectively.

Further investigation revealed two drawbacks of our implementation of MPI based inter-node communication. Firstly, MPI was inherently single threaded. Even for t > 1, for the whole duration when the workers in Hybrid-DCA sent their local updates to the master and received back the merged global update from the master, only the main thread in the workers was active and all other threads were idle. Though there has been academic research such as [21] on making MPI communications multi-threaded taking advantage of OpenMP threads available within the same task, the research has not been incorporated in the standard MPI implementations. This inherent drawback hindered Hybrid-DCA to take full advantage of all the cores available. The second drawback was the way we implemented asynchronous inter-node data transfers in Hybrid-DCA for S < p. In the absence of a collective operation for receiving updates from only a subset of all workers in the standard MPI implementation, we implemented such a collective operation using primitive MPI “send and receive” operations that lacked the performance improvement of CoCoA+ that utilized the optimized MPI collectives such as MPI_Allreduce.

To mask the effect of the two drawbacks in the MPI, we drew the plots for speedup and proficiency, as shown in the bottom two rows of Fig. 7, after ignoring the time involved in MPI communications. Though CoCoA+ also received the advantage of ignoring MPI time and showed better performance for datasets rcv1 and webspam, it became totally outperformed by Hybrid-DCA on the sample-heavy kddb dataset. We also observed better performance of asynchronous Hybrid-DCA than synchronous Hybrid-DCA. In summary, as expected theoretically, we see almost the same speedup for a fixed number of total p × t cores irrespective of individual values of p and t. The drop in performance of Hybrid-DCA on t > 14 cores could be due to 1) the increase of time taken in an atomic memory update as the number of threads increase, and 2) for higher number of threads the delay bound of local updates may violate the assumption in (16). Both the aspects of atomic memory writes for higher number of threads have been investigated and worked around in a follow-up paper [25] of original PassCoDe paper [7], however, the incorporation of a similar fix in our implementation is out of scope for this paper.

We ignored MPI time for the computation of speedup and proficiency only for the experiments described in the section, for the experiments elsewhere we do include MPI time while measuring the wall time.

5.7. Effects of the parameter S

Fig. 8 shows the results of varying S ∈ {2, 3, 4, 6, 8} with fixed Γ = 10 on p = 8 nodes each with t = 8 cores. When S < p/2, only a minority of the workers contributed in a round and the duality gap did not progress smoothly below some certain level. On the other hand, when at least half of the workers contributed in each round, it was possible to achieve the same duality level obtained using all the workers. However, the reduction in time per round was eventually eaten by the larger number of rounds required to achieve the same duality gap. Nevertheless, we will see in a later section that the approach was useful for HPC platforms with heterogeneous nodes, unlike ours, where the waiting for updates from all workers had larger penalty per round, or for the case, where the need was to run for a specified number of rounds and quickly achieved a reasonably good duality gap.

5.8. Effects of the parameter Γ

Fig. 9 shows the results of varying Γ ∈ {1, 2, 3, 4, 10} with fixed S = 6 on p = 8 nodes each with t = 8 cores. We did not see much effect of Γ as the HPC platform used for our experiments had homogeneous nodes. Our experimentation showed that even if we used Γ = 10, the stale value at any worker was for at most 4 rounds. We expect to see a larger variance of staleness in case of heterogeneous nodes.

5.9. Effects of workload and processing power

When the HPC system had homogeneous nodes, our experimental results showed that asynchronous Hybrid-DCA did not change much when varying S and Γ. To show usefulness of asynchronous Hybrid-DCA on heterogeneous systems, we introduced imbalance in the setup for an experiment on p = 4 nodes each with t = 4 cores in two different ways: 1) by varying workload and 2) independently varying processing power. To vary workload, instead of distributing the datapoints equally on all 4 nodes, we loaded one of the nodes heavily by distributing the datapoints in the ratios 1:1:1:10. Similarly, to see the effect of heterogeneous processing speed, we introduced 10 seconds of delay in one of the nodes.

We compared the performance of synchronous Hybrid-DCA (S = 4) and asynchronous Hybrid-DCA (S = 3, Γ = 10) in the two imbalanced scenarios as well as on the usual balanced-load, homogeneous-speed scenario by plotting the progress on duality gap across rounds and wall time as shown in Fig. 10. For both scenarios of imbalanced load and heterogeneous speed the performance of synchronous Hybrid-DCA degraded significantly in terms of both across round and across time. However, the asynchronous Hybrid-DCA was able to mitigate the effects of imbalance completely in terms of performance across rounds in both scenarios and across wall time in heterogeneous speed scenario. For imbalanced load, asynchronous Hybrid-DCA was not able to improve performance as the time taken by the heavily loaded node dominated the overall time.

Fig. 10. — Effect of varying workload and processing power on p = 4 worker nodes.

5.10. Performance on a big dataset

We experimented our hybrid algorithm on the big dataset splicesite of size about 280 GB and compared with the previous best algorithm CoCoA+. Because of the enormous size, the dataset could not be accommodated on a single node and hence PassCoDe could not be run on this dataset. In this experiment, we used the number of global iterations G = 1000, 000. The results are shown in Fig. 11 where the progress of duality gap across the rounds of communication is shown on the left and across the wall time on the right. To achieve a duality gap of at least 10⁻⁵ on 16 nodes, CoCoA+ took about 165 seconds. On the other hand Hybrid-DCA on 16 nodes each using 12 cores took about 29 seconds to achieve the same duality gap giving approximately 6-fold improvement, showing enough evidence about the scalability of our algorithm. One could also argue that CoCoA+ can be run on all these 16 × 12 = 192 cores, treating each core as a distributed node. However, when we experimented with this mode of CoCoA+, we found that CoCoA+ did not reached the duality gap 10⁻⁵ in a stipulated 100 rounds. We also experimented CoCoA+ on 16 × 8 = 128 cores treating each core as a node and found out that though it performed better than 16 × 1 cores in the initial few rounds but worse in the later rounds. Moreover, it was outperformed by Hybrid-DCA in terms of both the number of rounds and the time taken even on 16 nodes each using 8 cores.

Fig. 11. — Performance of *Hybrid-DCA* on big dataset splicesite.

6. Conclusions

In this paper, we have presented a hybrid parallel and distributed asynchronous stochastic dual coordinate ascent algorithm utilizing modern HPC platforms with many nodes of multi-core shared-memory systems. We analyze the convergence properties of this novel algorithm which uses asynchronous updates at two cascading levels: inter-cores and inter-nodes. Experimental results show that our algorithm is faster than the state-of-the-art distributed algorithms and scales better than the state-of-the-art parallel algorithms. The effectiveness of our approach in practical implementations can be increased further by a combination of (1) optimizing overhead of OpenMP threads, (2) incorporating multi-threaded implementation [21] of MPI operation, and (3) fixing the issues of larger delays in atomic memory writes for larger number of threads as given in [26].

Acknowledgment

This work was partially supported by a National Science Foundation grant CCF-1514357, and a grant from National Institutes of Health 5K02DA043063-04 to J. Bi and two National Science Foundation grants NSF-1743418 and NSF-1843025 to S. Rajasekaran. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

Biographies

graphic file with name nihms-1596845-b0012.gif

Soumitra Pal received a B.E.degree in Computer Science and Technology from Bengal Engineering and Science University, Howrah, India in 2000 and M.Tech. and Ph.D. degrees in Computer Science and Engineering from IIT Bombay in 2007 and 2013, respectively. He was a postdoctoral research fellow at the University of Connecticut, USA during 2014–6. He also worked with Texas Instruments, India as a Software Design Engineer during 2000–5. His research interests are in algorithm design, optimization, bioinformatics and parallel and distributed systems.

graphic file with name nihms-1596845-b0013.gif

Tingyang Xu is a Ph.D. student in the Department of Computer Science and Engineering, the University of Connecticut. Since 2011, he is a member of the Laboratory of Machine Learning and Health Informatics led by Professor Jinbo Bi. His main areas of research interests are machine learning in longitudinal and time series data and high performance computing for optimization problems. He is also a student member of the Institute of Electrical and Electronics Engineers (IEEE).

graphic file with name nihms-1596845-b0014.gif

Tianbao Yang is an Assistant Professor of the Computer Science Department at the University of Iowa. He received the Ph.D. degree in Computer Science from Michigan State University in 2012. He worked as a researcher in GE Global Research from 2012 to 2013 and in NEC Laboratories America, Inc. from 2013 to 2014. He has board interests in machine learning and has focused on several research topics, including social network analysis and large scale optimization in machine learning. He has won the Mark Fulk Best student paper award at 25th Conference on Learning Theory (COLT) in 2012. He also served as program committee for several conferences, including AAAI’15, AAAI’12, CIKM’12, ’13, IJCAI’13, ACML’12.

graphic file with name nihms-1596845-b0015.gif

Sanguthevar Rajasekaran received his M.E. degree in Automation from the Indian Institute of Science (Bangalore) in 1983, and his Ph.D. degree in Computer Science from Harvard University in 1988. Currently he is the UTC Chair Professor of Computer Science and Engineering at the University of Connecticut and the Director of Booth Engineering Center for Advanced Technologies (BECAT). Before joining UConn, he has served as a faculty member in the CISE Department of the University of Florida and in the CIS Department of the University of Pennsylvania. During 2000–2002 he was the Chief Scientist for Arcot Systems. His research interests include Parallel Algorithms, Bioinformatics, Data Mining, Randomized Computing, Computer Simulations and Combinatorial Optimization. He has published over 150 articles in journals and conferences. He has co-authored two texts on algorithms and co-edited four books on algorithms and related topics. He is an IEEE Fellow and an elected member of the Connecticut Academy of Science and Engineering.

graphic file with name nihms-1596845-b0016.gif

Jinbo Bi received a Ph.D. degree in mathematics from Rensselaer Polytechnic Institute, USA, and a master degree in Electrical Engineering and Automatic Control from Beijing Institute of Technology, China. She is an associate professor of Computer Science and Engineering at the University of Connecticut. Prior to her current appointment, she worked with Siemens Medical Solutions on computer aided diagnosis research and Partners Healthcare on clinical decision support systems. Her research interests include machine learning, data mining, bioinformatics and biomedical informatics.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

We use the term ‘proficiency’ to differentiate with a related term ‘efficiency’ widely used in the literature.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1].Agarwal Alekh, Chapelle Olivier, Dudik Miroslav, and Langford John. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15(1):1111–1133, 2014. [Google Scholar]
[2].Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn, 3 (1)2011. 1–122. [Google Scholar]
[3].Chang Chih-Chung, Lin Chih-Jen. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2 (3)201127. [Google Scholar]
[4].Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang Xiang-Rui, Lin Chih-Jen. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9 2008. 1871–1874. [Google Scholar]
[5].Heinze Christina, McWilliams Brian, Meinshausen Nicolai. DUAL-LOCO: Distributing statistical estimation using random projections. In Gretton Arthur, Robert Christian C., (Eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, PMLR, Cadiz, Spain, 2016, pp. 875–883. [Google Scholar]
[6].Hsieh Cho-Jui, Chang Kai-Wei, Lin Chih-Jen, Keerthi S. Sathiya, Sundararajan Sellamanickam. A dual coordinate descent method for large-scale linear SVM, in: Proceedings of the 25th International Conference on Machine Learning, (ICML), 2008, pp. 408–415. [Google Scholar]
[7].Hsieh Cho-Jui, Yu Hsiang-Fu, Dhillon Inderjit S. PASSCoDe: Parallel asynchronous stochastic dual co-ordinate descent, in: Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. [Google Scholar]
[8].Jaggi Martin, Smith Virginia, Takáč Martin, Terhorst Jonathan, Krishnan Sanjay, Hofmann Thomas, Jordan Michael I. Communication-efficient distributed dual coordinate ascent, in: Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3068–3076. [Google Scholar]
[9].Lee Jason D., Lin Qihang, Ma Tengyu, and Yang Tianbao. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement, J. Mach. Learn. Res, 18 (122) 2017. 1–43, URL http://jmlr.org/papers/v18/16-640.html. [Google Scholar]
[10].Liu Ji, Wright Stephen J, Ré Christopher, Bittorf Victor, and Sridhar Srikrishna. An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res, 16 2015. 285–322. [Google Scholar]
[11].Ma Chenxin, Konečný Jakub, Jaggi Martin, Smith Virginia, Jordan Michael I., Richtárik Peter, Takáč Martin. Distributed optimization with arbitrary local solvers, Optim. Methods Softw, 32(4): 813–848, 2017. doi: 10.1080/10556788.2016.1278445. [DOI] [Google Scholar]
[12].Mcdonald Ryan, Mohri Mehryar, Silberman Nathan, Walker Dan, and Mann Gideon S. Efficient large-scale distributed training of conditional maximum entropy models In Advances in Neural Information Processing Systems, pages 1231–1239, 2009. [Google Scholar]
[13].McWilliams Brian, Heinze Christina, Meinshausen Nicolai, Krummenacher Gabriel, Vanchinathan Hastagiri P., LOCO: Distributing ridge regression with random projections. 2014, arXiv preprint arXiv:1406.3469. [Google Scholar]
[14].Moulines Eric and Bach Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459. [Google Scholar]
[15].Needell Deanna, Ward Rachel, Srebro Nati. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, in: Advances in Neural Information Processing Systems, 2014, pp. 1017–1025. [Google Scholar]
[16].Peng Zhimin, Xu Yangyang, Yan Ming, Yin Wotao. ARock: An algorithmic framework for asynchronous parallel coordinate updates, SIAM J. Sci. Comput, 38 (5) 2016. A2851–A2879, 10.1137/15M1024950. [DOI] [Google Scholar]
[17].Richtárik Peter, Takáč Martin. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res, 17 (1) 2016. 2657–2681, URL http://dl.acm.org/citation.cfm?id=2946645.3007028. [Google Scholar]
[18].Shalev-Shwartz Shai, Zhang Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res, 14 (1) 2013. 567–599. [Google Scholar]
[19].Shalev-Shwartz Shai, Zhang Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program, 1(155) 2016. 105–145. [Google Scholar]
[20].Shamir Ohad, Srebro Nathan, and Zhang Tong. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, in: Proceedings of the 31th International Conference on Machine Learning, ICML, Beijing, China, 21-26 June 2014, 2014, pp. 1000–1008. [Google Scholar]
[21].Si Min, Peña Antonio J, Balaji Pavan, Takagi Masamichi, Ishikawa Yutaka. MT-MPI: Multithreaded MPI for many-core environments, in: Proceedings of the 28th ACM International Conference on Supercomputing, ACM, 2014, pp. 125–134. [Google Scholar]
[22].Takáč Martin, Bijral Avleen, Richtarik Peter, and Srebro Nati. Mini-Batch primal and dual methods for SVMs, in: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1022–1030, 2013. [Google Scholar]
[23].Yang Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent, in: Advances in Neural Information Processing Systems, 2013, pp. 629–637. [Google Scholar]
[24].Yu Hsiang-Fu, Huang Fang-Lan, Lin Chih-Jen. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn, 85 (1) 2011. 41–75. [Google Scholar]
[25].Zhang Tong, Solving large scale linear prediction problems using stochastic gradient descent algorithms,, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 116. [Google Scholar]
[26].Zhang Huan and Hsieh Cho-Jui. Fixing the convergence problems in parallel asynchronous dual coordinate descent, in: Sixteenth IEEE International Conference on Data Mining, ICDM, IEEE, 2016, pp. 619–628. [Google Scholar]
[27].Zhang Ruiliang. Kwok James, Asynchronous distributed ADMM for consensus optimization, in: Proceedings of the 31st International Conference on Machine Learning, ICML, 2014, pp. 1701–1709. [Google Scholar]
[28].Zhang Yuchen, Xiao Lin. Communication-efficient distributed optimization of self-concordant empirical loss in: Large-Scale and Distributed Optimization, Springer, 2018, pp. 289–341. [Google Scholar]

[R1] [1].Agarwal Alekh, Chapelle Olivier, Dudik Miroslav, and Langford John. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15(1):1111–1133, 2014. [Google Scholar]

[R2] [2].Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn, 3 (1)2011. 1–122. [Google Scholar]

[R3] [3].Chang Chih-Chung, Lin Chih-Jen. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2 (3)201127. [Google Scholar]

[R4] [4].Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang Xiang-Rui, Lin Chih-Jen. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9 2008. 1871–1874. [Google Scholar]

[R5] [5].Heinze Christina, McWilliams Brian, Meinshausen Nicolai. DUAL-LOCO: Distributing statistical estimation using random projections. In Gretton Arthur, Robert Christian C., (Eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, PMLR, Cadiz, Spain, 2016, pp. 875–883. [Google Scholar]

[R6] [6].Hsieh Cho-Jui, Chang Kai-Wei, Lin Chih-Jen, Keerthi S. Sathiya, Sundararajan Sellamanickam. A dual coordinate descent method for large-scale linear SVM, in: Proceedings of the 25th International Conference on Machine Learning, (ICML), 2008, pp. 408–415. [Google Scholar]

[R7] [7].Hsieh Cho-Jui, Yu Hsiang-Fu, Dhillon Inderjit S. PASSCoDe: Parallel asynchronous stochastic dual co-ordinate descent, in: Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. [Google Scholar]

[R8] [8].Jaggi Martin, Smith Virginia, Takáč Martin, Terhorst Jonathan, Krishnan Sanjay, Hofmann Thomas, Jordan Michael I. Communication-efficient distributed dual coordinate ascent, in: Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3068–3076. [Google Scholar]

[R9] [9].Lee Jason D., Lin Qihang, Ma Tengyu, and Yang Tianbao. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement, J. Mach. Learn. Res, 18 (122) 2017. 1–43, URL http://jmlr.org/papers/v18/16-640.html. [Google Scholar]

[R10] [10].Liu Ji, Wright Stephen J, Ré Christopher, Bittorf Victor, and Sridhar Srikrishna. An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res, 16 2015. 285–322. [Google Scholar]

[R11] [11].Ma Chenxin, Konečný Jakub, Jaggi Martin, Smith Virginia, Jordan Michael I., Richtárik Peter, Takáč Martin. Distributed optimization with arbitrary local solvers, Optim. Methods Softw, 32(4): 813–848, 2017. doi: 10.1080/10556788.2016.1278445. [DOI] [Google Scholar]

[R12] [12].Mcdonald Ryan, Mohri Mehryar, Silberman Nathan, Walker Dan, and Mann Gideon S. Efficient large-scale distributed training of conditional maximum entropy models In Advances in Neural Information Processing Systems, pages 1231–1239, 2009. [Google Scholar]

[R13] [13].McWilliams Brian, Heinze Christina, Meinshausen Nicolai, Krummenacher Gabriel, Vanchinathan Hastagiri P., LOCO: Distributing ridge regression with random projections. 2014, arXiv preprint arXiv:1406.3469. [Google Scholar]

[R14] [14].Moulines Eric and Bach Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in: Advances in Neural Information Processing Systems, 2011, pp. 451–459. [Google Scholar]

[R15] [15].Needell Deanna, Ward Rachel, Srebro Nati. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, in: Advances in Neural Information Processing Systems, 2014, pp. 1017–1025. [Google Scholar]

[R16] [16].Peng Zhimin, Xu Yangyang, Yan Ming, Yin Wotao. ARock: An algorithmic framework for asynchronous parallel coordinate updates, SIAM J. Sci. Comput, 38 (5) 2016. A2851–A2879, 10.1137/15M1024950. [DOI] [Google Scholar]

[R17] [17].Richtárik Peter, Takáč Martin. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res, 17 (1) 2016. 2657–2681, URL http://dl.acm.org/citation.cfm?id=2946645.3007028. [Google Scholar]

[R18] [18].Shalev-Shwartz Shai, Zhang Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res, 14 (1) 2013. 567–599. [Google Scholar]

[R19] [19].Shalev-Shwartz Shai, Zhang Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program, 1(155) 2016. 105–145. [Google Scholar]

[R20] [20].Shamir Ohad, Srebro Nathan, and Zhang Tong. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method, in: Proceedings of the 31th International Conference on Machine Learning, ICML, Beijing, China, 21-26 June 2014, 2014, pp. 1000–1008. [Google Scholar]

[R21] [21].Si Min, Peña Antonio J, Balaji Pavan, Takagi Masamichi, Ishikawa Yutaka. MT-MPI: Multithreaded MPI for many-core environments, in: Proceedings of the 28th ACM International Conference on Supercomputing, ACM, 2014, pp. 125–134. [Google Scholar]

[R22] [22].Takáč Martin, Bijral Avleen, Richtarik Peter, and Srebro Nati. Mini-Batch primal and dual methods for SVMs, in: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1022–1030, 2013. [Google Scholar]

[R23] [23].Yang Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent, in: Advances in Neural Information Processing Systems, 2013, pp. 629–637. [Google Scholar]

[R24] [24].Yu Hsiang-Fu, Huang Fang-Lan, Lin Chih-Jen. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn, 85 (1) 2011. 41–75. [Google Scholar]

[R25] [25].Zhang Tong, Solving large scale linear prediction problems using stochastic gradient descent algorithms,, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 116. [Google Scholar]

[R26] [26].Zhang Huan and Hsieh Cho-Jui. Fixing the convergence problems in parallel asynchronous dual coordinate descent, in: Sixteenth IEEE International Conference on Data Mining, ICDM, IEEE, 2016, pp. 619–628. [Google Scholar]

[R27] [27].Zhang Ruiliang. Kwok James, Asynchronous distributed ADMM for consensus optimization, in: Proceedings of the 31st International Conference on Machine Learning, ICML, 2014, pp. 1701–1709. [Google Scholar]

[R28] [28].Zhang Yuchen, Xiao Lin. Communication-efficient distributed optimization of self-concordant empirical loss in: Large-Scale and Distributed Optimization, Springer, 2018, pp. 289–341. [Google Scholar]

PERMALINK

Hybrid-DCA: A double asynchronous approach for stochastic dual coordinate ascent

Soumitra Pal

Tingyang Xu

Tianbao Yang

Sanguthevar Rajasekaran

Jinbo Bi

Abstract

1. Introduction

Fig. 1.

2. Related work

Sequential Algorithms.

Distributed Algorithms.

Parallel Algorithms.

Other Distributed Methods for RRM.

3. The Proposed Algorithm

3.1. Asynchronous updates by cores in a worker node

Algorithm 1:

Algorithm 2:

3.2. Merging updates from workers by master

Example.

Fig. 2.

3.3. Communication cost analysis

4. Convergence Analysis

Fig. 3.

4.1. Near optimality of the solution to the local subproblem

Fig. 4.

Induction Hypothesis.

Induction Basis.

Induction Step.

4.2. Convergence of global solution

5. Experimental Results

5.1. Datasets

Table 1.

5.2. Comparison of algorithms

5.3. Parameter settings

5.4. Optimization performance

Fig. 5.

5.5. Test accuracy

Fig. 6.

5.6. Speedup

Fig. 7.

5.7. Effects of the parameter S

Fig. 8.

5.8. Effects of the parameter Γ

Fig. 9.

5.9. Effects of workload and processing power

Fig. 10.

5.10. Performance on a big dataset

Fig. 11.

6. Conclusions

Acknowledgment

Biographies

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases