Abstract
When the data are stored in a distributed manner, direct applications of traditional statistical inference procedures are often prohibitive due to communication costs and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.
Keywords: Distributed statistical estimation, Communication efficiency, Multi-round algorithms, Penalized likelihood
1. Introduction
Statistical inference in modern era faces tremendous challenge on computation and storage. The exceedingly large size of data often makes it impossible to store all of them on a single machine. Moreover, many applications have individual agents (e.g. local governments, research labs, hospitals, smart phones) collecting data independently. Communication is prohibitively expensive due to the limited bandwidth, and direct data sharing raises concerns in privacy and loss of ownership. These constraints make it necessary to develop methodologies for distributed systems, solving statistical problems with divide-and-conquer procedures and communicating only certain summary statistics.
Distributed statistical inference has received considerable attention recently, covering a wide spectrum of topics including M-estimation (Zhang et al., 2013; Chen and Xie, 2014; Shamir et al., 2014; Rosenblatt and Nadler, 2016; Wang et al., 2017a; Lee et al., 2017b; Battey et al., 2018; Wang et al., 2018; Shi et al., 2018; Banerjee et al., 2019), principal component analysis (Fan et al., 2019; Garber et al., 2017), nonparametric regression (Shang and Cheng, 2017; Szabó and Van Zanten, 2019; Han et al., 2018), quantile regression (Volgushev et al., 2019; Chen et al., 2021), bootstrap (Kleiner et al., 2014), confidence intervals (Jordan et al., 2019; Chen et al., 2021), Bayesian methods (Wang and Dunson, 2013; Jordan et al., 2019), etc. In the commonly-used setting, the overall dataset is partitioned and stored on m node machines connected to a central processor. Most of the approaches only require one round of communication: the node machines work in parallel and send their results to the central processor, which then aggregates the information to get a final result. As typical examples, Zhang et al. (2013) average the M-estimators on node machines; Battey et al. (2018) average debiased estimators; and Fan et al. (2019) average subspaces via eigen-decomposition. While these one-shot methods are communication-efficient, they only work with a small number of node machines (e.g. , where N is the total sample size) and require large sample on each, as their theories heavily rely on asymptotic expansions of estimators. Such conditions are easily violated in practice.
Multi-round procedures come as a remedy, which alternate between local computations and global aggregations. It is possible to achieve optimal statistical precision after a few rounds of communications, under broader settings than those for one-shot procedures. Shamir et al. (2014) propose a Distributed Approximate NEwton (DANE) algorithm where, in each iteration, each node machine minimizes a modified loss function based on its own samples and the gradient information from all other machines obtained through communication. However, for non-quadratic losses, the analysis in Shamir et al. (2014) does not imply any advantage of DANE in terms of communication over distributed implementation of gradient descent. Other approximate Newton algorithms include Zhang and Xiao (2015), Wang et al. (2018), Chen et al. (2021) and Crane and Roosta (2019). Jordan et al. (2019) develop a Communication-efficient Surrogate Likelihood (CSL) framework for estimation and inference in regular parametric models, penalized regression, and Bayesian statistics. A similar method also appears in Wang et al. (2017a). These methods no longer have restrictions on the number of machines such as .
Due to the nature of Newton-type methods, existing theories for these algorithms heavily rely on good initialization or even self-concordance assumption on loss functions. They essentially focus on improving an initial estimator that is already consistent but not efficient, whose ideas coincide with the classical one-step estimator (Bickel, 1975). Such initialization itself needs additional efforts and assumptions. Moreover, current results still require each machine to have sufficiently many samples so that loss functions on different machines are similar to each other. These all make the proposed methods unreliable in practice.
Aside from distributed statistical inference, there has also been a vast literature in distributed optimization. The ADMM (Boyd et al., 2011) is a celebrated example among the numerous algorithms that handle deterministic optimization problems with minimum structural assumption. Yet, the convergence can be quite slow and it cannot fully utilize the similarity among loss functions on node machines.
In this paper, we develop and study two Communication-Efficient Accurate Statistical Estimators (CEASE) based on multi-round algorithms for distributed statistical estimation. Our new algorithms extend the DANE algorithm (Shamir et al., 2014) to regularized empirical risk minimization. Moreover, we provide sharp convergence guarantees for general scenarios, even if the local loss functions are dissimilar and regularization is nonsmooth.
We assume that all the m node machines have the same sample size n. Each has a regularized empirical risk function fk+g defined by the samples stored there, and the goal is to compute the minimizer of the overall regularized risk function to statistical precision. When n is sufficiently large, their rates of convergence are better than or comparable to existing methods designed for this large-sample regime. For moderate or small n, they are still guaranteed to converge linearly even without good initialization, while other statistical methods fail. In addition, our algorithms take advantage of the similarity among and thus improve over general-purpose algorithms like ADMM. They interpolate between distributed algorithms for statistical estimation and general deterministic problems. Theoretical findings are verified by extensive numerical experiments. From a technical point of view, our algorithms use the proximal point algorithm (Rockafellar, 1976) as the backbone and obtain inexact updates in a distributed manner. This turns out to be crucial for proving convergence under general conditions. Our techniques are potentially useful for studying other distributed algorithms.
The rest of this paper is organized as follows. Section 2 introduces the algorithms. Section 3 presents deterministic convergence results. Section 4 provides guarantees in statistical problems. Section 5 shows numerical results on both synthetic and real data. Section 6 concludes the paper and discusses possible future directions.
Here we list the notations used throughout the paper. We denote by [n] the set {1, 2, ⋯, n}. We write an = O(bn) or an ≲ bn if there exists a constant C > 0 such that an ≤ Cbn holds for sufficiently large n; and an ≍ bn if an = O(bn) and bn = O(an). Given and r > 0, we define and . For a convex function h on , we let ∂h(x) be its sub-differential set at , and be the set of its minimizers if . We use ∥·∥2 to denote the ℓ2 norm of a vector or operator norm of a matrix. For two sequences of random variables and where Yn≥0, we write if for any ε > 0 there exists C > 0 such that for sufficiently large n. We use to refer to the sub-Gaussian norm of random variable X, and to denote the sub-Gaussian norm of random vector X.
2. The CEASE algorithm
2.1. Problem setup
Let be an unknown probability distribution over some sample space . For any parameter , define its population risk based on a loss function . In parametric inference problems, ℓ is often chosen as the negative log-likelihood function of some parametric family. Under mild conditions, F is well-defined and has a unique minimizer θ*. A ubiquitous problem in statistics and machine learning is to estimate θ* given i.i.d. samples from , and the minimizer of the empirical risk becomes a natural candidate. To achieve desirable precision in high-dimensional problems, it is often necessary to incorporate prior knowledge of θ*. A principled approach is the regularized empirical risk minimization
(2.1) |
where g(θ) is a deterministic penalty function. Common choices for g(θ) include the ℓ2 penalty (Hoerl and Kennard, 1970), the ℓ1 penalty λ∥θ∥1 (Tibshirani, 1996), and a family of folded concave penalty functions ∥pλ(∣θ∣)∥1 such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), where λ > 0 is a regularization parameter. Throughout the paper, we assume that both ℓ and g are convex in θ, and ℓ is twice continuously differentiable in θ. We allow g to be nonsmooth (e.g. the ℓ1 penalty).
Consider the distributed setting where the N samples are stored on m machines connected to a central processor. Denote by the index set of samples on the kth machine and . For simplicity, we assume that are disjoint, N is a multiple of m, and for all k ∈ [m]. Then (2.1) can be rewritten as
(2.2) |
Each machine k only has access to its local data and hence local loss function fk and the penalty g. We aim to solve (2.2) in a distributed manner with both statistical efficiency and communication-efficiency.
2.2. Adaptive gradient enhancements and distributed algorithms in large-sample regimes
In the large-sample regime, we drop the regularization term for now and consider the empirical risk minimization problem for estimating . In some problems, direct minimization of f is costly, while it is easy to obtain some rough estimate that is close to θ* but not as accurate as the global minimimizer . Bickel (1975) proposes the one-step estimator based on the local quadratic approximation and shows that it is as efficient as θ if the initial estimator is accurate enough. Iterating this further results in multiple-step estimators that improve the optimization error and hence statistical errors when the initial estimator is not good enough (Robinson, 1988). This inspires us to refine an existing estimator using some proxy of f.
In the distributed environment, starting from an initial estimator , the gradient vector can easily be communicated. Construct a linear function , the first-order Taylor expansion of f around . The object function to be minimized can be written as
Since the linear function f(1)(θ) can easily be communicated to each node machine whereas R(·) can not, the latter is naturally replaced by its subsampled version at node k:
where fk(θ) is the loss function based on the data at node k. With this replacement, the target of optimization at node k becomes f(1)(θ) + Rk(θ), which equals to
up to an additive constant. This function will be called gradient-enhanced loss function, in which the gradient at point based on the local data is replaced by the global one. This function has one very nice fixed point at the global minimum θ : the minimizer of the adaptive gradient-enhanced function at is still θ. This can easily be seen by verifying that the gradient at the point θ is zero.
The idea of using such an adaptive gradient-enhanced function has been proposed in Shamir et al. (2014) and Jordan et al. (2019), though the motivations are different. Jordan et al. (2019) develop a Commmunication-efficient Surrogate Likelihood (CSL) method using the gradient-enhanced loss function on the first machine, uses the minimizer on that machine as a new estimate, and iterates these steps until convergence. In the presence of a regularizer g in (2.1), one simply adds g to the gradient-enhanced loss; see the Algorithm 1 below.
Algorithm 1 CSL (Jordan et al., 2019) |
---|
Input: Initial value θ0, number of iterations T. |
For t = 0,1,2,⋯,T–1: |
• Each machine evaluates ∇fk(θt) and sends to the 1st machine; |
• The 1st machine computes and |
θt+1 = argminθ{f1(θ)+g(θ)–⟨∇f1(θt)–∇f(θt),θ⟩} |
and broadcasts to other machines. |
Output: θT. |
Note that in Algorithm 1, only the first machine solves optimization problems and others just evaluate gradients. These machines are idling while the first one is working hard. To fully utilize the computing power of machines and accelerate convergence, all the machines can optimize their corresponding gradient-enhanced loss functions in parallel and the central processor then aggregates the results. This is motivated by the Distributed Approximate NEwton (DANE) algorithm (Shamir et al., 2014). Algorithm 2 describes the procedure in detail. Intuitively, the averaging step requires little computation but helps reduce the variance of estimators on node machines and enhance the accuracy.
Algorithm 2 Distributed estimation using gradient-enhanced loss |
---|
Input: Initial value θ0, number of iterations T. |
For t = 0,1,2,⋯,T–1: |
• Each machine evaluates ∇fk(θt) and sends to the central processor; |
• The central processor computes and broadcasts to machines; |
• Each machine computes |
θt,k = argminθ{fk(θ)+g(θ)–⟨∇fk(θt)–∇f(θt),θ⟩} |
and sends to the central processor; |
The central processor computes and broadcasts to machines. |
Output: θT. |
We now illustrate Algorithm 2 in the context of linear regression. Given samples {(xi, yi)}i∈[N], the kth machine defines a quadratic loss function
Here, and . The overall loss function is , where . Then the update of Algorithm 2 in one iteration is
(2.3) |
(2.4) |
Intuitively, this is a form of contraction towards the global minimizer θ. As for the logistic regression, we can also write out the corresponding enhanced losses and minimize them using Newton’s method. Due to space limitations, we refer to Appendix C for details.
2.3. The CEASE Algorithm in general regimes
Algorithms 1 and 2 are built upon large-sample regimes, with sufficiently strong convexity of and small discrepancy between them. This requires the local sample size n to be large enough, which may not be the case in practice. Even worse, the required local sample size depends on structural parameters, making such a condition unverifiable. In fact, our numerical experiments confirm the instability of Algorithms 1 and 2 even for moderate n. A naive method of remedy is to add strict convex quadratic regularization q(θ). While this remedy can make the algorithm converge rapidly, the nonadaptive nature of q(θ) will lead to a wrong target. Instead of using a fixed q, we will adjust it according to current solutions. The idea stems from the proximal point algorithm (Rockafellar, 1976).
Definition 2.1. For any convex function , define the proximal mapping , .
For a given α > 0, the proximal point algorithm for minimizing h iteratively computes
starting from some initial value x0. It is a strongly convex optimization, shrinking towards the current value xt. Under mild conditions, converges linearly to some (Rockafellar, 1976).
Now we take h = f + g and write the proximal point iteration for our problem (2.2):
(2.5) |
Each iteration (2.5) is a distributed optimization problem, whose object function is not available to node machines. But it can be solved by Algorithms 1 and 2. Specifically, suppose we have already obtained θt and aim for θt+1 in (2.5). Letting , Algorithm 2 starting from θ0 = θt produces iterations over s = 0,1,⋯
When α+ρ0>δ, converges Q-linearly * to θt+1. On the other hand, there is no need to solve (2.5) exactly, as proxα−1(f+g)(θt) is merely an intermediate quantity for computing θ. We therefore only run one iteration of Algorithm 2 and use the resulting approximate solution as θt+1. This considerably simplifies the algorithm, reducing double loops to a single loop, and enhances statistical interpretation of the method as a multi-step estimator. However, it makes technical analysis more challenging. Similarly, we may also use one step of Algorithm 1 to compute the inexact proximal update.
The above discussions lead us to propose two Communication-Efficient Accurate Statistical Estimators (CEASE) in Algorithms 3 and 4, which use the proximal point algorithm as the backbone and obtain inexact updates in a distributed manner. They are regularized versions of Algorithms 1 and 2, with an additional proximal term in the objective functions. That term reduces relative differences of the local loss functions on individual machines, and is crucial for convergence when are not similar enough. In Appendix A we introduce a variant of Algorithm 4 which also stablizes Algorithm 2.
Algorithm 3 Communication-Efficient Accurate Statistical Estimators (CEASE) |
---|
Input: Initial value θ0, regularizer α ≥ 0, number of iterations T. |
For t = 0,1,2,⋯,T–1: |
• Each machine evaluates ∇fk(θt) and sends to the 1st machine; |
• The 1st machine computes and |
, |
and broadcasts to other machines. |
Output: θT. |
Algorithm 4 CEASE with averaging |
---|
Input: Initial value θ0, regularizer α ≥ 0, number of iterations T. |
For t = 0,1,2,⋯,T–1: |
• Each machine evaluates ∇fk(θt) and sends to the central processor; |
• The central processor computes and broadcasts to machines; |
• Each machine computes |
and sends to the central processor; |
• The central processor computes and broadcasts to machines. |
Output: θT. |
In each iteration, Algorithm 3 has one round of communication and one optimization problem to solve. Although Algorithm 4 has two rounds of communication per iteration, only one round involves parallel optimization and the other is simply averaging. We will compare their theoretical guarantees as well as practical performances in the sequel.
Algorithm 4 is an extension of the DANE algorithm in Shamir et al. (2014) to regularized empirical risk minimization. While DANE is originally motivated by mirror descent, we view it as a distributed implementation of the proximal point algorithm. The new perspective helps us obtain stronger convergence guarantees. Ideas from the proximal point algorithm have appeared in the literature of distributed stochastic optimization for different purposes such as accelerating first-order algorithms (Lee et al., 2017a) and regularizing sizes of updates (Wang et al., 2017b).
3. Deterministic analysis
We first present in Section 3.1 the deterministic (almost sure) results for Algorithms 3 and 4 based on high-level structural assumptions. As special cases of these algorithms with α = 0, Algorithms 1 and 2 will be analyzed in Section 3.2.
3.1. Deterministic analysis of the CEASE algorithm
Definition 3.1. Let be a convex function, be a convex set, and ρ≥0. h is ρ-strongly convex in Ω if , ∀x,y ∈Ω and g ∈ ∂h(x).
Assumption 3.1 (Strong convexity). f + g has a unique minimizer , and is ρ-strongly convex in B(θ, R) for some R > 0 and ρ > 0.
Assumption 3.2 (Homogeneity). ∥∇2fk(θ)–∇2f(θ)∥2≤δ, ∀k ∈ [m], θ ∈ B(θ, R).
We will refer to δ as a homogeneity parameter. Based on both assumptions, we define
(3.1) |
A simple but useful fact is max{ρ–δ, 0} ≤ ρ0 ≤ ρ. In most interesting problems, the population risk F is smooth and strongly convex on any compact set. When are i.i.d. and the total sample size N is large, the empirical risk f concentrates around F and inherits nice properties from the latter, making Assumption 3.1 hold easily.
On the other hand, since the empirical risk functions are i.i.d. stochastic approximations of the population risk F, they should not be too far away from their average f provided that n is not too small. Assumption 3.2 is a natural way of characterizing this similarity. It is a generalization of the concept “δ-related functions” for quadratic losses in Arjevani and Shamir (2015). With high probability, it holds with reasonably small δ and large R under general conditions. Large n implies small homogeneity parameter δ and thus similar . Assumption 3.2 always holds with .
The following additional assumption on smoothness of the Hessian matrix of f + g is not necessary for contraction, but it helps us obtain a much stronger result on the contraction rate of Algorithm 2, justifying the power of the simple averaging step.
Assumption 3.3 (Smoothness of Hessian). , and there exists M ≥ 0 such that ∥[∇2f(θ′) + ∇2g(θ′)]−[∇2f(θ″)+∇2g(θ″)]∥2≤M∥θ′–θ″∥2,∀θ′,θ″ ∈ B(θ, R).
Theorem 3.1 gives contraction guarantees for Algorithms 3 and 4. It is deterministic and non-asymptotic by nature.
Theorem 3.1. Let Assumptions 3.1 and 3.2 hold. Consider the multi-step estimators generated by Algorithm 3 or 4. Suppose that θ0 ∈ B(θ, R/2) and [δ/(ρ0 + α)]2 < ρ/(ρ+2α).
-
If Assumption 3.3 also holds, then for Algorithm 4 we have
(3.3) where we define ;
Both multiplicative factors in (3.2) and (3.3) are strictly less than 1.
In the contraction factor in (3.2), the two summands and come from bounding the inexact proximal update ∥θt+1–proxα−1(f+g) (θt)∥2 and the residual ∥proxα−1(f+g) (θt)–θ∥2, respectively. Similar results hold for (3.3). The condition [δ/(ρ0 + α)2 < ρ/(ρ+ 2α) ensures that both contraction factors are less than 1. Note that (3.3) requires Assumption 3.3, which forces g to be smooth.
Theorem 3.1 shows the Q-linear convergence of the sequence generated by both Algorithms 3 and 4 under quite general settings. The contraction rate depends explicitly on the structural parameters and the choice of α. The local loss functions just need to be convex and smooth, and the convex penalty g is allowed to be non-smooth, e.g. the ℓ1 norm. On the contrary, most algorithms for distributed statistical estimation are only designed for smooth problems, and many of them are only rigorously studied when the loss functions are quadratic or self-concordant (Shamir et al., 2014; Zhang and Xiao, 2015; Wang et al., 2017b). This is another important aspect of our contributions.
We immediately see from Theorem 3.1 that Algorithms 3 and 4 converge linearly as long as [δ/(ρ0 + α)]2 < ρ/(ρ+2α), which is guaranteed to hold by choosing sufficiently large α. By contrast, however, we’ll show in Section 3.2 that Algorithms 1 and 2 (corresponding to α = 0) hinges on the homogeneity assumption ρ0 > δ in Theorem 3.2, i.e. the functions must be similar enough. In the statistical setting, this requires the local sample size n to be large. Therefore, proper regularization provides a safety net for the algorithms under general regimes with potentially insufficient local sample size. Corollary 3.1 below gives a guideline for choosing α to make Algorithms 3 and 4 converge in general.
Corollary 3.1. Let Assumptions 3.1 and 3.2 hold, θ0 ∈ B(θ, R/2), and be the iterates of Algorithm 3 or 4. With any α ≥ 4δ2 / ρ, both algorithms converge with contraction factors in (3.2) and (3.3) bounded by .
On the other hand, consider the case where the local loss functions have small relative difference δ/ρ. In this case, Theorem 3.2 states that the contraction factors for unregularized versions (α = 0) of Algorithms 3 and 4 are in the same order of δ/ρ and (δ/ρ)2, respectively, which are smaller than the contracting factors with α > 0. The following corollary characterizes the upper bound for α so that the contraction factors remain at these small orders.
Corollary 3.2. Let Assumptions 3.1 and 3.2 hold, θ0 ∈ B(θ, R/2), and suppose α ≤ Cδ2 / ρ for some constant C. There exist constants C1 and C2 such that the followings hold when δ/ρ is sufficiently small:
- If Assumption 3.3 also holds and ∥θt – θ∥2≤ ρ/M, then for Algorithm 4
Note that the second result above only holds given Assumption 3.3, which requires a smooth regularization g. Corollary 3.2 reveals that by choosing α ≍ δ2/ρ, the contraction factors are essentially the same as those of the unregularized (α = 0) algorithms when δ/ρ is small. By combining Corollaries 3.1 and 3.2, we use α ≍ δ2/ρ as a default choice for Algorithms 3 and 4 to become both fast and robust. They are reliable in general cases (Corollary 3.1) and efficient in nice cases (Corolary 3.2 and Theorem 3.1 with α = 0)
Algorithms 3 and 4 attain communication efficiency by utilizing similarity among local loss functions: The contraction factors in Corollary 3.2 go to zero if δ/ρ does. In fact, both algorithms achieve ε-accuracy within rounds of communication. In contrast, the distributed accelerated gradient descent requires rounds of communication to achieve ε-accuracy (Shamir et al., 2014), with κ0 being the condition number of (f + g), which does not take advantage of sample size n. As long as , Algorithms 3 and 4 communicate less than the distributed accelerated gradient descent. And again, our general results for Algorithms 3 and 4 also apply to the case with nonsmooth penalty functions while those for distributed accelerated gradient descent do not.
Moreover, if (f + g) is smooth and θt is reasonably close to θ, Corollary 3.2 shows that each iteration of Algorithm 4 is roughly equivalent to two iterations of Algorithm 3, although the former only has one round of optimization. The averaging step in Algorithm 4 reduces the error as much as the optimization step, while taking much less time. In this case, Algorithm 4 is preferable, and our numerical experiments also confirm this.
For unregularized empirical risk minimization, i.e. g = 0 in (2.2), Algorithm 4 reduces to an extension or a useful case of the DANE algorithm (Shamir et al., 2014). In this case, Theorem 3.1 and its corollaries refine the analysis of DANE (Shamir et al., 2014) in several aspects. On the one hand, our analysis handle both smooth and nonsmooth problems, while in Shamir et al. (2014), the theoretical analysis beyond quadratic loss requires extremal choice of tuning parameters and does not show any advantage over distributed implementation of the gradient descent. On the other hand, as mentioned in Section 2, we derive Algorithm 4 from the proximal point algorithm with a new prospective, which leads to sharp convergence analysis along with suggestions on choosing the tuning parameter α. As a by-product, we close a gap in the theory of DANE in non-quadratic settings. Our analysis techniques are potentially useful for other distributed optimization algorithms, especially when the loss is not quadratic.
3.2. Deterministic analysis in large-sample regimes
In this section, we restrict ourselves to large-sample regimes where the local sample size n is sufficiently large such that ρ0 > δ ≥ 0, where ρ0 is the strong convexity parameter in (3.1). The following theorem gives deterministic results for Algorithms 1 and 2.
Theorem 3.2. Let Assumptions 3.1 and 3.2 hold, and ρ0 > δ ≥ 0. Consider the iterates produced by Algorithm 1 or 2, with θ0 ∈ B(θ, R). Then
In addition, if Assumption 3.3 also holds, then for Algorithm 2 we have
Note that the last inequality requires Assumption 3.3 and thus a smooth regularization g. The first part of Theorem 3.2 is a refinement of the analysis in Jordan et al. (2019), since we allow the initial estimator to be inaccurate and we have more explicit rates of contraction of optimization errors. This will be further demonstrated in Section 4.2.
The second part points out benefits of the averaging step, which is a novel result. Similar to the results on Algorithm 4, Theorem 3.2 shows that when θt is close to θ, with an additional standard assumption on Hessian smoothness, the averaging step alone in Algorithm 2 is almost as powerful as an optimization step in terms of contraction: The contracting constant will eventually be . With negligible computational cost, averaging significantly improves upon individual solutions by doubling the speed of convergence.
4. Statistical analysis
We further analyze the statistical properties of the above algorithms under a generalized linear model. Essentially, both the CSL methods and the CEASE algorithm are T-step estimators, starting from the initial estimator θ0. The question here is the effect of iterations in the multiple step estimators and the role of the initial estimator. We start with statistical analysis of Algorithms 3 and 4 in Section 4.1, and then study Algorithms 1 and 2 in Section 4.2. In Section 4.3, we provide practical guidance when implementing the CEASE algorithm based on these analysis.
4.1. Multi-step estimators in general regimes
The deterministic analysis in Section 3.2 applies to a wide range of statistical models. Here we consider the generalized linear model with canonical link, where our samples are i.i.d. pairs of covariates and responses and the conditional density of yi given xi is given by
For simplicity, we let the dispersion parameter to be 1 as we do not consider the issue of over-dispersion; b(·) is some known convex function, and c is a known function such that h is a valid probability density function. The negative log-likelihood of the whole data is an affine transformation of with
It is easy to verify that
Assume that , where are i.i.d. random covariate vectors with zero mean and covariance matrix Σ. Suppose there exist universal positive constants A1, A2 and A3 such that . Let , g be a deterministic penalty function, and be the population risk function. Below we impose some standard regularity assumptions.
Assumption 4.1.
are i.i.d. sub-Gaussian random vectors.
For all , ∣b″(x)∣ and ∣b′″(x)∣ are bounded by some constant.
∥θ*∥2 is bounded by some constant.
As in Assumptions 3.1 and 3.3, the following general assumptions are also needed for our analysis. Here R is some positive quantity that satisfies R < A4pA5 for some universal constants A4 and A5.
Assumption 4.2. There exists a universal constant ρ > 0 such that (F + g) is ρ-strongly convex in B(θ*, 2R).
The following smoothness assumption is only needed for a part of our theory; it is used to show that the averaging step in Algorithm 4 can significantly enhance the accuracy.
Assumption 4.3. , and there exists a universal constant M ≤ 0 such that ∥[∇2F(θ′) + ∇2g(θ′)]–[∇2F(θ″) + ∇2g(θ″)]∥2≤M∥θ′–θ″∥2, ∀θ′, θ″ ∈ B(θ*, 2R).
Under the model assumptions above, we can explicitly determine the rate of δ in Assumption 3.2. In particular, we will show in Lemma E.5 in Appendix E that
provided that n ≥ cp for an arbitrary positive constant c. Therefore, with high probability, . Omitting the logarithmic terms, we see that the contraction factor is approximately , where κ ≜∥ Σ ∥2/ρ can be viewed as a condition number. This rate is more explicit on p and κ than that in Jordan et al. (2019), where finite p and κ are assumed. In addition, with a smooth regularization, Algorithm 4 benefits from the averaging step in that it improves the contraction rate to approximately κ2p/n.
Let θt be the t-th iterate of one of the proposed algorithms. It is clear that the statistical error of the estimator θt is bounded by its optimization error and the statistical error of θ:
The second term is well-studied in statistics, which is of order under mild conditions. In the following theorem, we show that for the first term (i.e. the optimization error), with proper choice of α, each iteration of Algorithm 3 or 4 makes θt closer to the global minimum θ by some order depending on local sample size. Thus, through finite steps, the optimization errors are eventually negligible in comparison with statistical errors (assuming N is of order (n/p)a for a finite a in typical applications), and the distributed multi-step estimator will work as well as the global minimum as if the data were aggregated in the central server.
Theorem 4.1. Suppose that Assumptions 4.1 and 4.2 hold, and with high probability the initial value θ0 ∈B(θ, R/2). Let η = κ2 (log N)p/n and κ=∥Σ∥2/ρ. For any c1, c2 > 0, there exists C > 0 such that the followings hold with high probability:
-
If η is sufficiently small and α ≥ c2ρη, then for both algorithmsin addition, if Assumption 4.3 also holds, then for Algorithm 4 we have
where .
In contrast to a fixed contraction derived by Shamir et al. (2014), Theorem 4.1 explains the significant benefits of large local sample size even in the presence of a non-smooth penalty: the optimization error shrinks by a factor that converges to zero explicitly in n. As a brief illustration, let us consider the case with smooth loss functions, sufficient local sample size and no regularization. Let θ0 be the average of individual estimators on node machines. By Corollary 2 in Zhang et al. (2013), this simple divide-and-conquer estimator has accuracy . Using the explicit expression of η, we can easily deduce from Theorem 4.1 (b) that the one-step estimator θ1 obtained by Algorithm 3 behaves the same as the global minimizer θ if the local sample size satisfies n3 ≫ N(κ2 p log N)(p+κ2 log p). In this case, the local optimization in Algorithm 3 can further be replaced by using the explicit one-step estimator as in Bickel (1975) and Jordan et al. (2019), since the initial estimator is in a consistent neighborhood. More generally, the t-step estimator θt has negligible optimization error under even weaker local sample size requirement: nt+2 ≫ N(κ2 p log N)t (p+κ2 log p). A similar remark applies to Algorithm 4.
Similar to the deterministic results, the averaging step is about as effective as the optimization step when g is smooth and θt is sufficiently close to θ in that after a finite t0 iterations. See a simplified example in Section 4.2 with α = 0.
As for the initialization, the condition θ0 ∈B(θ, R/2) is mild, since θ is usually a consistent estimate and ∥θ*∥2 is bounded (Assumption 4.1). In contrast with Jordan et al. (2019), we allow inaccurate initial value such as θ0 = 0 and give more explicit rates of contraction even when p and κ diverge. On the other hand, the accuracy of the initial estimator θ0 does help reduce the number of iterations.
Combining the results to be presented in Theorem 4.2, we’ll see that by choosing α ≍ ρη, Algorithms 3 and 4 inherit all the merits of Algorithms 1 and 2 in the large-n regime – fast linear contraction of rate , and for Algorithm 4, a even faster rate of η = κ2p(log N)/n to θ when the loss and the penalty functions are smooth. These facts also guarantee that Algorithms 3 and 4 reach the statistical efficiency in iterations. On the other hand, compared to Algorithms 1 and 2, Algorithms 3 and 4 overcome the difficulties with a small local sample size n in that as long as n/p is bounded away from some small constant (which is reasonable for many big-data problems of interest), shrinkage of optimization error is guaranteed. Moreover, while it is hard to check whether n is sufficiently large in practice, proper choice of α always guarantees linear convergence, and the contraction rates adapt to the sample size n. In this way, Algorithms 3 and 4 perfectly resolve the main issue of their vanilla versions.
We can get stronger results in the specific case of distributed linear regression, where the contraction rate has nearly no dependence on the conditional number κ. Due to space constraints, we put all the details in Appendix B.
4.2. Multi-step estimators in large-sample regimes
We now present the contraction of optimization error of Algorithms 1 and 2.
Theorem 4.2. Suppose that Assumptions 4.1 and 4.2 hold, and with probability tending to one, θ0 ∈B(θ, R) for some R>∥θ–θ*∥2. For Algorithms 1 and 2, we have
where η = κ2p(log N)/n. In addition, let Assumption 4.3 also hold. There exists some constant C such that for Algorithm 2 we have
(4.1) |
where .
The strengthened result (4.1) requires Assumption 4.3 and thus smooth g. Theorem 4.2 shows that when n is sufficiently large, Algorithms 1 and 2 behave similarly as Algorithms 3 and 4 – faster convergence with the larger n, mild restrictions on initialization, and averaging speeds up contraction given smooth loss. However, there is no convergence guarantee in general regimes. Section 5 further shows that with insufficient local sample size, the practical performance Algorithms 1 and 2 is less satisfactory.
Finally, to see how the averaging step reduces the statistical error (i.e. the distance between the estimator and θ*), we continue to look at the linear regression example mentioned at the end of Section 2.2. For simplicity, assume {xi}i∈[N] are i.i.d. standard normal random vectors. (2.3) and (2.4) can be expressed as
If further n ≫ p, then the two contraction factors satisfy
(4.2) |
(4.3) |
When the Algorithm 1 applies to this problem, the expression changes slightly to
The smaller magnitude in (4.3) is due to the averaging.
Suppose that we initialize the algorithm using the one-shot average , where θk is the least squares solution on the kth machine. When , we have . Zhang et al. (2013) assert that . Thus, the initial statistical error is . By the contraction properties in (4.2) and (4.3), the optimization errors are
(33) |
We can see that when (or equivalently, P3/4N1/4 ≪ n ≪ p3/2N1/3), the optimization error is negligible for θ1, but is not negligible for θ1,k. When n is smaller, even more iterations are needed. A refined analysis of distributed least squares is in Appendix B.
4.3. Guidance on practice
We now provide some general guidance on how to implement the CEASE algorithm in practice. First, we recommend choosing an initialization depending on the magnitude of the local sample size n. In particular, when n is not very large compared to the dimension p, a zero initialization would be more robust. On the other hand, given a moderate or large n, the one-shot average estimator will lead to extremely fast convergence.
Second, according to Theorem 4.1, it suffices to take α of the order of ρκ2p log N/n. In practice, setting α to be a small multiple of p/n seems suitable in many occasions.
Finally, as is already shown, both Algorithm 3 and 4 reach statistical efficiency in iterations. In both our simulations and real data examples, the CEASE algorithms with a properly chosen α converge to the centralized estimator within 10 iterations. With a moderate n, a warm start further boosts the convergence speed.
5. Numerical experiments
5.1. Synthetic data
We first conduct distributed logistic regression to illustrate the effect of local sample size and initialization on convergence. We keep the total sample size N = 10000 and the dimensionality p = 101 fixed, and generate the i.i.d. data as follows: with ui ~ N(0p–1, Σ) and ; where is a random vector with norm 3 whose direction is chosen uniformly at random from the sphere. We use the natural logarithm of the estimation error ∥θt–θ*∥2 to measure the performance of different algorithms, including multiple versions of the CEASE algorithms, GIANT (Wang et al., 2018), ADMM (Boyd et al., 2011) and accelerated gradient descent (Nesterov, 1983).
Figure 1 shows how the estimation errors evolve with iterations. The curves show the average values over 100 independent runs; the error bands correspond to one standard deviation. The regimes “large n”, “moderate n” and “small n” refer to (n,m) = (2000, 5), (1000, 10) and (250, 40); “zero initialization” and “good initialization” refer to θ0 = 0 (bottom panel) and (top panel), respectively. Here is the one-shot distributed estimator (Zhang et al., 2013) that averages the individual estimators on node machines. According to Figure 1, the standard deviation of each iterate is around 0.1. For the “large n, good initialization” regime, all of the iterates are unsurprisingly very close to the optimal solution. Their error bands will cover up the curves. So we omit the bands in that case for the sake of clarity.
Fig. 1.
Impacts of local sample size and initialization on convergence. The x-axis and y-axis are the number of iterations and log ∥θt–θ*∥2. The dashed lines show the error of the minimizer of the overall loss function. The top and bottom panels use and 0 for initialization, respectively. CEASE(a) and CEASE(0) refer to Algorithm 4 with α = 0.15p/n and 0; CEASE-single(a) and CEASE-single(0) refer to Algorithm 3 with α = 0.15p/n and 0, respectively. In particular, CEASE-single(0) is equivalent to the CSL algorithm in Jordan et al. (2019).
With proper regularization, the two CEASE algorithms are the only ones that converge rapidly in all scenarios. The purely deterministic methods ADMM (Boyd et al., 2011) and accelerated gradient descent (Nesterov, 1983) are also reliable but slow. Other distributed algorithms like unregularized CEASE and GIANT (Wang et al., 2018) easily fail when the local sample size is small or the initialization is uninformative. In addition, the CEASE with averaging (Algorithm 4) is superior to the one without averaging (Algorithm 3). For example, when (n,m) = (1000,10), the averaged CEASE with α = 0 converges while the one without averaging does not. Hence the averaging step leads to better performance.
We also test the efficacy of our algorithms in the distributed ℓ1-regularized logistic regression, where the penalty g is nonsmooth (See Appendix D for details). To summarize, our simulations demonstrate several important properties of the CEASE Algorithms:
In all scenarios, the CEASE Algorithms converge rapidly, usually within several steps, which is consistent with our theory;
The CEASE Algorithms efficiently utilize statistical structures and similarities among local losses, and benefit from the averaging step with smooth loss functions;
The CEASE Algorithms are also able to handle the most general situations (e.g. small local sample size, uninformative initialization) with convergence guarantees.
5.2. Real data
As a real data example, we choose the Fashion-MNIST dataset (Xiao et al., 2017) as a testbed for comparison of algorithms. The whole dataset consists of 70000 grayscale images of fashion products in 10 classes, each of which has 6000 training samples and 1000 testing samples. We choose the 7th and 9th classes (Sneakers and Ankle boots) and the goal is to train a classifier that distinguishes them. Each image has 28 × 28 = 784 pixels, represented by a feature vector in [0, 1]784. The number of training (or testing) samples is 6000 × 2 = 12000 (or 1000 × 2 = 2000). We randomly partition the training set and conduct logistic regression in a distributed manner. The performance metric is the classification error on the testing set. Figure 2 shows the average performance of the CEASE algorithms, ADMM, GIANT and AGD based on 100 independent runs, together with error bars showing one standard deviation. Here “large n”, “moderate n” and “small n” refer to (n,m) = (1200, 10), (480, 25) and (240, 50), respectively. All of the iterations are initialized with the one-shot average (Zhang et al., 2013). The experiments on this real data example also support our theoretical findings.
Fig. 2.
Fashion-MNIST dataset. The x-axis and y-axis are the number of iterations and the testing error. The dashed lines show the error of the classifier based on all of the training samples. All of the iterations are initialized with the one-shot average . CEASE(a) and CEASE(0) refer to Algorithm 4 with α = 0.15p/n and 0; CEASE-single(a) and CEASE-single(0) refer to Algorithm 3 with α = 0.15p/n and 0, respectively. In particular, CEASE-single(0) is equivalent to the CSL algorithm in Jordan et al. (2019). GIANT and CEASE-single(0) do not converge to the optimal solution.
6. Discussions
We have developed two CEASE distributed estimators (Algorithms 3 and 4) for statistical estimation, with theoretical guarantees and superior performance on real data. Several new directions are worth exploring. First, while we assumed exact computation for simplicity, finer analysis should allow for inexact updates in practice. Second, we hope to extend the algorithms to decentralized and asynchronous settings. Third, distributed versions of confidence regions and hypothesis tests are of great importance, and our point estimation strategies may serve as a starting point. Finally, it will be interesting to explore non-convex statistical optimization problems such as mixture models and deep learning. We believe that the idea of gradient-enhanced loss function still plays an important role.
Supplementary Material
Acknowledgments
We gratefully acknowledge NSF grants DMS-1662139, DMS-1712591, DMS-2053832, DMS-2052926, NIH grant 2R01-GM072611-15, and ONR grant N00014-19-1-2120. We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20-RR030893-01, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.
Footnotes
SUPPLEMENTARY MATERIAL
Supplementary material: The file “supplementary.pdf” contains more details and proofs of the results in this paper.
According to Nocedal and Wright (2006), a sequence in is said to converge Q-linearly to if there exists r ∈(0, 1) such that ∥xn+1–x*∥2≤r∥xn–x*∥2 for n sufficiently large.
Contributor Information
Jianqing Fan, Department of ORFE, Princeton University.
Yongyi Guo, Department of ORFE, Princeton University.
Kaizheng Wang, Department of IEOR, Columbia University.
References
- Arjevani Y and Shamir O (2015). Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems, pages 1756–1764. [Google Scholar]
- Banerjee M, Durot C, and Sen B (2019). Divide and conquer in nonstandard problems and the super-efficiency phenomenon. The Annals of Statistics, 47(2):720–757. [Google Scholar]
- Battey H, Fan J, Liu H, Lu J, and Zhu Z (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3):1352–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bickel PJ (1975). One-step Huber estimates in the linear model. Journal of the American Statistical Association, 70(350):428–434. [Google Scholar]
- Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends[textregistered] in Machine learning, 3(1):1–122. [Google Scholar]
- Chen X, Liu W, and Zhang Y (2021). First-order Newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, pages 1–17.35757777 [Google Scholar]
- Chen X and Xie M.-g. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, pages 1655–1684. [Google Scholar]
- Crane R and Roosta F (2019). Dingo: Distributed Newton-type method for gradient-norm optimization. In Advances in Neural Information Processing Systems, volume 32. [Google Scholar]
- Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. [Google Scholar]
- Fan J, Wang D, Wang K, and Zhu Z (2019). Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6):3009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garber D, Shamir O, and Srebro N (2017). Communication-efficient algorithms for distributed stochastic principal component analysis. In International Conference on Machine Learning, volume 70, pages 1203–1212. [Google Scholar]
- Han Y, Mukherjee P, Ozgur A, and Weissman T (2018). Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 506–510. IEEE. [Google Scholar]
- Hoerl AE and Kennard RW (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67. [Google Scholar]
- Jordan MI, Lee JD, and Yang Y (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681. [Google Scholar]
- Kleiner A, Talwalkar A, Sarkar P, and Jordan MI (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816. [Google Scholar]
- Lee JD, Lin Q, Ma T, and Yang T (2017a). Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446. [Google Scholar]
- Lee JD, Liu Q, Sun Y, and Taylor JE (2017b). Communication-efficient sparse regression. The Journal of Machine Learning Research, 18(5):1–30. [Google Scholar]
- Nesterov YE (1983). A method for solving the convex programming problem with convergence rate O(1/k2). In Doki. Akad. Nauk SSSR, volume 269, pages 543–547. [Google Scholar]
- Nocedal J and Wright SJ (2006). Numerical optimization (Second Edition). Springer. [Google Scholar]
- Robinson PM (1988). The stochastic difference between econometric statistics. Econometrica: Journal of the Econometric Society, pages 531–548. [Google Scholar]
- Rockafellar RT (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877–898. [Google Scholar]
- Rosenblatt JD and Nadler B (2016). On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379–404. [Google Scholar]
- Shamir O, Srebro N, and Zhang T (2014). Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning, pages 1000–1008. [Google Scholar]
- Shang Z and Cheng G (2017). Computational limits of a distributed algorithm for smoothing spline. The Journal of Machine Learning Research, 18(1):3809–3845. [Google Scholar]
- Shi C, Lu W, and Song R (2018). A massive data framework for M-estimators with cubic-rate. Journal of the American Statistical Association, pages 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szabó B and Van Zanten H (2019). An asymptotic analysis of distributed nonparametric methods. The Journal of Machine Learning Research, 20:87–1. [Google Scholar]
- Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288. [Google Scholar]
- Volgushev S, Chao S-K, and Cheng G (2019). Distributed inference for quantile regression processes. The Annals of Statistics, 47(3):1634–1662. [Google Scholar]
- Wang J, Kolar M, Srebro N, and Zhang T (2017a). Efficient distributed learning with sparsity. In International Conference on Machine Learning, pages 3636–3645. [Google Scholar]
- Wang J, Wang W, and Srebro N (2017b). Memory and communication efficient distributed stochastic optimization with minibatch prox. In Conference on Learning Theory, pages 1882–1919. [Google Scholar]
- Wang S, Roosta-Khorasani F, Xu P, and Mahoney MW (2018). Giant: Globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, pages 2338–2348. [Google Scholar]
- Wang X and Dunson DB (2013). Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605 [Google Scholar]
- Xiao H, Rasul K, and Vollgraf R (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 [Google Scholar]
- Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. [Google Scholar]
- Zhang Y, Duchi JC, and Wainwright MJ (2013). Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research, 14:3321–3363. [Google Scholar]
- Zhang Y and Xiao L (2015). Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning, pages 362–370. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.