Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Aug 5.
Published in final edited form as: J Mach Learn Res. 2023;24:262.

Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training?

Shuxiao Chen , Qinqing Zheng , Qi Long , Weijie J Su
PMCID: PMC11299893  NIHMSID: NIHMS1961790  PMID: 39105110

Abstract

A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual’s perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an approximate alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant.

As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.

Keywords: Empirical Risk Minimization, Federated Learning, Personalization, Data Heterogeneity, Minimax Rates, Algorithmic Stability

1. Introduction

As one of the most important ingredients driving the success of machine learning, data are being generated and subsequently stored in an increasingly decentralized fashion in many real-world applications. For example, mobile devices will in a single day collect an unprecedented amount of data from users. These data commonly contain sensitive information such as web search histories, online shopping records, and health information, and thus are often not available to service providers (Poushter, 2016). This decentralized nature of (sensitive) data poses substantial challenges to many machine learning tasks.

To address this issue, McMahan et al. (2017) proposed a new learning paradigm, which they termed federated learning, for collaboratively training machine learning models on data that are locally possessed by multiple clients with the coordination of the central server (e.g., service provider), without having direct access to the local datasets. In its simplest form, federated learning considers a pool of m clients, where the i-th client has a local dataset Si of size ni, consisting of i.i.d. samples zj(i):jni (denote [n]{1,2,,n}) from some unknown distribution 𝒟i. Letting (w,z) be a loss function, where w denotes the model parameter, the optimal local model for the i-th client is given by

w(i)argmin wEZi~𝒟iw,Zi. (1)

From the client-wise perspective, any data-dependent estimator w^(i)(S), with S=Sii=1m denoting the collection of all samples, can be evaluated based on its individualized excess risk:

IERiEZi~𝒟iw^(i),Zi-w(i),Zi,

where the expectation is taken over a fresh sample Zi~𝒟i. At a high level, this learning paradigm of federated learning aims to obtain possibly different trained models for each client such that the individualized excess risks are low (see, e.g., Kairouz et al. 2019).

From a statistical viewpoint, perhaps the most crucial factor in determining the effectiveness of federated learning is data heterogeneity. When the data distribution 𝒟i is (approximately) homogeneous across different clients, presumably a single global model would lead to small IERi for all i. In this regime, indeed, McMahan et al. (2017) proposed the federated averaging algorithm (FedAvg, see Algorithm 1), which can be regarded as an instance of local stochastic gradient descent (SGD) for solving (Mangasarian and Solodov, 1993; Stich, 2019)

minw1Ni[m]niLi(w,Si), (2)

where Liw,Sijniw,zj(i)/ni is the empirical risk minimization (ERM) objective of the i-th client and N=n1++nm denotes the total number of training samples. Translating Algorithm 1 into words, FedAvg in effect learns a shared global model using gradients from each client and outputs a single model as an estimate of w(i) for all clients. When the distributions 𝒟i coincide with each other, FedAvg with a strongly convex loss achieves a weighted average excess risk of 𝒪(1/N), which is minimax optimal up to a constant factor (Shalev-Shwartz et al., 2009; Agarwal et al., 2012), see the formal statement in Theorem 6.

However, it is an entirely different story in the presence of data heterogeneity. FedAvg has been recognized to give inferior performance when there is a significant departure from complete homogeneity (see, e.g., Bonawitz et al. 2019). To better understand this point, consider the extreme case where the data distributions 𝒟i are entirely unrelated. This roughly amounts to saying that the model parameters w(i) can be arbitrarily different from each other. In such a “completely heterogeneous” scenario, the objective function (2) simply has no clear interpretation, and any single global model—for example, the output of FedAvg—would lead to unbounded risks for most, if not all, clients. As a matter of fact, it is not difficult to see that the optimal training strategy for federated learning in this regime is arguably PureLocalTraining, which lets each client separately run SGD to minimize its own local ERM objective

min w(i)Liw(i),Si (3)

without any communication. Indeed, PureLocalTraining is minimax rate optimal in the completely heterogeneous regime, just as FedAvg in the completely homogeneous regime (see Theorem 4).

graphic file with name nihms-1961790-f0001.jpg

The level of data heterogeneity in practical federated learning problems is apparently neither complete homogeneity nor complete heterogeneity. Thus, the foregoing discussion raises a pressing question of what would happen if we are in the wide middle ground of the two extremes. This underlines the essence of personalized federated learning, which seeks to develop algorithms that perform well over a wide spectrum of data heterogeneity. Despite a venerable line of work on personalized federated learning (see, e.g., Kulkarni et al. 2020), the literature remains relatively silent on how the fundamental limits of personalized federated learning depend on data heterogeneity, as opposed to two extreme cases where both the minimax optimal rates and algorithms are known.

1.1. Main Contributions

The present paper takes a step toward understanding the statistical limits of personalized federated learning by establishing the minimax rates of convergence for both individualized excess risks and their weighted average with smooth strongly convex losses. We briefly summarize our main contributions below.

  1. We prove that if the client-wise sample sizes are relatively balanced, then there exists a problem instance on which the IERi’s of any algorithm are lower bounded by
    Ω1/N+R2if R2=𝒪(m/N)Ω(m/N)if R2=Ω(m/N), (4)
    where R is the minimum quantity satisfying minw𝒲i[m]niw(i)-w2/NR2, i.e., it measures the maximum level of heterogeneity among clients (here throughout the paper denotes the Euclidean distance). Meanwhile, we show that the IERi’s of FedAvg are upper bounded by 𝒪1/N+R2, whereas the guarantee for PureLocalTraining is 𝒪(m/N), regardless of the specific value of R. Moreover, we also establish similar upper and lower bounds for a weighted average of the IERi’s under a weaker condition.
  2. A closer look at the above-mentioned bounds reveals a perhaps surprising phenomenon: for a given collection of problem instances with a specified maximum level of heterogeneity, exactly one of FedAvg or PureLocalTraining is minimax optimal.

  3. The established minimax results suggest that the naïve dichotomous strategy of (1) running FedAvg when R2=𝒪(m/N), and (2) running PureLocalTraining when R2=Ω(m/N), attains the lower bound (4). Moreover, for supervised problems, this dichotomous strategy can be implemented without knowing R by (1) running both FedAvg and PureLocalTraining, (2) evaluating the test errors of the two algorithms in a distributed fashion, and (3) deploying the algorithm with a lower test error. We emphasize that the notion of optimality under our consideration overlooks constant factors. In practice, a better personalization result could be achieved by more sophisticated algorithms.

  4. As a side product, we provide a novel analysis of FedProx, a popular algorithm for personalized federated learning that constrains the learned local models to be close via 2 regularization (Li et al., 2018). In particular, we show that its IERi’s are of order 𝒪1N/mRN/m+mN, and a weighted average of the IERi’s satisfies a tighter 𝒪1N/mRN/m+1N bound, where ab=min{a,b} for two real numbers a and b.

  5. On the technical side, our upper bound analysis is based on a generalized notion of algorithmic stability (Bousquet and Elisseeff, 2002), which we term federated stability and can be of independent interest. Briefly speaking, an algorithm 𝒜(S)=w^(i)(S) has federated stability γi if for any i[m], the loss function evaluated at w^(i)(S) can only change by an additive term of 𝒪γi, if we perturb Si a little bit, while keeping the rest of datasets Si:ii fixed. Similar ideas have appeared in Maurer (2005) and have been recently applied to multi-task learning (Wang et al., 2018). However, their notion of perturbation is based on the deletion of the whole client-wise dataset, whereas our notion of federated stability operates at the “record-level” and is more fine-grained. On the other hand, our construction of the lower bound is based on a generalization of Assound’s lemma (Assouad, 1983) (see also Yu 1997), which enables us to handle multiple heterogeneous datasets.

1.2. Related Work

Ever since the proposal of federated learning by McMahan et al. (2017), recent years have witnessed a rapidly growing line of work that is concerned with various aspects of FedAvg and its variants (see, e.g., Khaled et al. 2019; Haddadpour and Mahdavi 2019; Li et al. 2020b; Bayoumi et al. 2020; Malinovsky et al. 2020; Li and Richtárik 2020; Woodworth et al. 2020; Yuan and Ma 2020; Zheng et al. 2021).

In the context of personalized federated learning, there have been significant algorithmic developments in recent years. While the idea of using 2 regularization to constrain the learned models to be similar has appeared in early works on multi-task learning (Evgeniou and Pontil, 2004), its applicability to personalized federated learning was only recently demonstrated by Li et al. (2018), where the FedProx algorithm was introduced. Similar regularization-based methods have been proposed and analyzed from the scope of convex optimization in Hanzely and Richtárik (2020); Dinh et al. (2020), and Hanzely et al. (2020). In particular, Hanzely et al. (2020) showed that an accelerated variant of FedProx is optimal in terms of communication complexity and the local oracle complexity. There is also a line of work using model-agnostic meta learning (Finn et al., 2017) to achieve personalization (Jiang et al., 2019; Fallah et al., 2020). Other strategies have been proposed (see, e.g., Arivazhagan et al. 2019; Li and Wang 2019; Mansour et al. 2020; Yu et al. 2020), and we refer readers to Kulkarni et al. (2020) for a comprehensive survey. We briefly remark here that all the papers mentioned above only consider the optimization properties of their proposed algorithms, while we focus on statistical properties of personalized federated learning.

Compared to the optimization understanding, our statistical understanding (in terms of sample complexity) of federated learning is still limited. Deng et al. (2020) proposed an algorithm for personalized federated learning with learning-theoretic guarantees. However, it is unclear how their bound scales with the heterogeneity among clients.

More generally, exploiting the information “shared among multiple learners” is a theme that constantly appears in other fields of machine learning such as multi-task learning (Caruana, 1997), meta learning (Baxter, 2000), and transfer learning (Pan and Yang, 2009), from which we borrow a lot of intuitions (see, e.g., Ben-David et al. 2006; Ben-David and Borbely 2008; Ben-David et al. 2010; Maurer et al. 2016; Cai and Wei 2019; Hanneke and Kpotufe 2019, 2020; Du et al. 2020; Tripuraneni et al. 2020a, b; Kalan et al. 2020; Shui et al. 2020; Li et al. 2020a; Zhang et al. 2020; Jose and Simeone 2021).

More related to our work, a series work by Denevi et al. (2018, 2019); Balcan et al. (2019), and Khodak et al. (2019) assumes the optimal local models lie in a small sub-parameter-space, and establishes “heterogeneity-aware” bounds on a weighted average of individualized excess risks. However, we would like to point out that they operate under the online learning setup, where the datasets are assumed to come in streams, and this is in sharp contrast to the federated learning setup, where the datasets are decentralized. Our notion of heterogeneity is also related to the hierarchical Bayesian model considered in Bai et al. (2020); Lucas et al. (2020); Konobeev et al. (2020), and Chen et al. (2020).

1.3. Paper Organization

The rest of this paper is organized as follows. In Section 2, we give an exposition of the problem setup and main assumptions. Section 3 presents our main results with proof sketches. We conclude this paper with a discussion of open problems in Section 4. For brevity, detailed proofs are deferred to the appendix.

2. Problem Setup

In this section, we detail some preliminaries to prepare the readers for our main results.

Notation.

We introduce the notation we are going to use throughout this paper. For two real numbers a,b, we let ab=max{a,b} and ab=min{a,b}. For two non-negative sequences an,bn, we denote anbn (resp. anbn) if anCbn (resp. anCbn) for some constant C>0 when n is sufficiently large. We use anbn to indicate that anbn,abbn hold simultaneously. We also use an=𝒪bn, whose meaning is the same as anbn, and an=Ωbn, whose meaning is the same as anbn. For two probability distributions 𝒟1 and 𝒟2, we use 𝒟1𝒟2 to denote their joint distribution under independence. We use 𝒲 to denote the parameter space and 𝒵 to denote the sample space. Finally, we let 𝒫𝒲(x)argmin y𝒲x-y denote the operator that projects x onto 𝒲 in Euclidean distance.

Evaluation Metrics.

The presentation of our main results relies on how to evaluate the performance of a federated learning algorithm. To this end, we consider the following two evaluation metrics.

Definition 1 (Individualized excess risk) Consider an algorithm 𝒜 that outputs 𝒜(S)=w^(i)(S)i=1m. For the i-the client, its individualized excess risk (IER) is defined as

IERi(𝒜)EZi~𝒟iw^(i)(S),Zi-w(i),Zi, (5)

where Zi~𝒟i is a fresh data point independent of S.

Definition 2 (p-average excess risk) Consider an algorithm 𝒜 that outputs 𝒜(S)=w^(i)(S). For a vector p=p1,,pm lying in the m-dimensional probability simplex (i.e., all pi’s are non-negative and they sum to one), we define the p-average excess risk (AERp) of 𝒜 to be

AERp(𝒜)i[m]piIERi(𝒜). (6)

In words, IER measures the performance of the algorithm from the client-wise perspective, whereas AER evaluates the performance of the algorithm from the system-wide perspective.

Intuitively speaking, the weight vector p in (6) can be regarded as the importance weight on each client and controls “how many resources are allocated to each client”. For example, setting pi=1/m enforces “fair allocation”, so that each client is treated uniformly, regardless of sample sizes. As another example, setting pi=ni/N (recall that N=i[m]ni is the total sample size) means that the central server pays more attention to clients with larger sample sizes, which, to a certain extend, incentivize the clients to contribute more data.

Notably, while a uniform upper bound on all IERi’s can be carried over to the same bound on AERp, a bound on the AERp alone in general does not imply a tight bound on each IERi, other than the trivial bound IERiAERp/pi. Such a subtlety is a distinguishing feature of personalized federated learning in the following sense: under homogeneity, it suffices to estimate a single shared global model, and thus AERp and all of IERis are mathematically equivalent.

Regularity Conditions.

In this paper, we restrict ourselves to bounded, smooth, and strongly convex loss functions. Such assumptions are common in the federated learning literature (see, e.g., Li et al. 2020b; Hanzely et al. 2020) and cover many unsupervised learning problems such as mean estimation in exponential families and supervised learning problems such as generalized linear models.

Assumption A (Regularity conditions) Suppose the following conditions hold:

  1. Compact and convex domain. The parameter space 𝒲 is a compact convex subset of Rd with diameter Dsupw,w𝒟w-w<;

  2. Smoothness and strong convexity. For any i[m], the loss function (,z) is β-smooth for almost every z in the support of 𝒟i, and the i-th ERM objective Li(,S) is almost surely μ-strongly convex on the convex domain 𝒲Rd. We also assume that there exists a universal constant such that 0(,z) for almost every z in the support of 𝒟i;

  3. Bounded gradient variance at optimum. There exists a positive constant σ such that for any i[m], we have EZi~𝒟iw(i),Zi2σ2.

Heterogeneity Conditions.

To quantify the level of heterogeneity among clients, we start by introducing the notion of an average global model. Assuming a strongly convex loss, the optimal local models (1) are uniquely defined. Thus, we can define the average global model as

wp(global)=i[m]piw(i). (7)

We remark that the average global model defined in (7) should not be interpreted as the “optimal global model”. Rather, it is more suitable to think of wp(global)  as a point in the parameter space, from which every local model is close to. Indeed, one can readily check that the average global model is the minimizer of i[m]piw(i)-w2 over wRd.

We are now ready to quantify the level of client-wise heterogeneity as follows.

Assumption B (Level of heterogeneity)There exists a positive constantRsuch that
  1. either i[m]piw(i)-wp(global)2R2,

  2. or w(i)-wp(global)2R2i[m].

Our study of the AERp and IERis will be based on Part (a) and (b) of Assumption B, respectively. Intuitively, the quantity R encodes one’s belief on “how heterogeneous” the clients can be.

3. Main Results

3.1. Analyses of Two Baseline Algorithms

In this subsection, we characterize the performance of PureLocalTraining and FedAvg under the heterogeneity conditions imposed by Assumption B.

3.1.1. Warm Up: Uniform Stability and Analysis of PureLocalTraining

The analysis of PureLocalTraining is based on the classical notion of uniform stability, proposed by Bousquet and Elisseeff (2002).

Definition 3 (Uniform stability) Consider an algorithm 𝒜 that takes a single dataset S=zjj=1n of size n as input and outputs a single model: 𝒜(S)=wˆ(S). We say 𝒜 is γ-uniformly stable if for any dataset S, any j[n], and any zj𝒵, we have

(wˆ(S),)-wˆSj,γ,

where Sj is the dataset formed by replacing zj with zj:

Sj=z1,,zj-1,zj,zj,,zn.

The main implication of uniformly stable algorithms is that “stable algorithms do not overfit”: if 𝒜 is γ-uniformly stable, then its generalization error is upper bounded by a constant multiple of γ. Thus, one can dissect the analysis of 𝒜 into two separate parts: (1) bounding its optimization error; (2) bounding its stability term.

Under our working assumptions, SGD with properly chosen step sizes is guaranteed to converge to the global minimum of (3) (see, e.g., Rakhlin et al. (2011)). Note that the bounds for the approximate minimizers only involve an extra additive term representing the optimization error, and this term will be negligible if we run SGD until convergence since our focus is sample complexity. Thus, we conduct the analysis for the global minimizer of (3). The performance of PureLocalTraining is given by the following theorem.

Theorem 4 (Performance of PureLocalTraining) Let Assumption A(b) hold and assume ni4β/μi[m]. Then the algorithm 𝒜PLT  which outputs the minimizer of (3) satisfies

ESIERi𝒜PLTβμni

for all i=1,,m.

Proof The proof is a direct consequence of standard results on uniform stability of strongly convex ERM (see, e.g., Section 5 of Shalev-Shwartz et al. (2009) and Section 13 of Shalev-Shwartz and Ben-David (2014)), which assert that under the current assumptions, the minimizer of (3) is 𝒪βμni-uniformly stable. We omit the details. ■

By definition, for any weight vector p, AERp of PureLocalTraining also admits the same upper bound as (4).

3.1.2. Federated Stability and Analysis of FedAvg

We consider the following weighted version of (2):

minw𝒲i[m]piLi(w,Si). (8)

The FedAvg algorithm (Algorithm 1) also seamlessly generalizes. The above optimization formulation is in fact covered by the general theory of Li et al. (2020b), where they showed that FedAvg is guaranteed to converge to the global optimum under a suitable hyperparameter choice, even in the presence of heterogeneity (but the convergence is slower). Thus, in the following discussion, we again consider the global minimizer of (8).

It turns out that a tight analysis of FedAvg requires a more fine-grained notion of uniform stability, which we present below.

Definition 5 (Federated stability) An algorithm 𝒜 that outputs 𝒜(S)=w^(i)(S) has federated stability γii=1m if for every S~i𝒟ini and for any i[m],jini,zi,ji𝒵, we have

w^(i)(S),-w^(i)Si,ji,γi.

Above, Si,ji is the dataset formed by replacing zji(i) in the i-the dataset with zi,ji:

Si,ji=S1,,Si-1,Siji,Si+1,,Sm,
Siji=z1(i),,zji-1(i),zi,ji,zji+1(i),,zni(i).

Compared to the conventional uniform stability in Definition 3, federated stability provides a finer control by allowing distinct stability measures γi for different clients. Moreover, the classical statement that “stable algorithms do not overfit” still holds, in the sense that the average (resp. individualized) generalization error can be upper bounded by 𝒪i[m]niγi/N (resp. 𝒪γi), plus a term scaling with the level of heterogeneity R. And this again enables us to separate the analysis of 𝒜 into two parts (namely bounding the optimization error and bounding the stability), as is the case with the conventional uniform stability.

The notion of federated stability has other implications when restricted to the FedProx algorithm, and we refer the readers to Section 3.4 for details.

We are now ready to state the theorem that characterizes the performance of FedAvg.

Theorem 6 (Performance of FedAvg) Let Assumption A(b, c) hold and assume ni4βpi/μi[m]. Suppose the FedAvg algorithm 𝒜FA outputs the minimizer of (8). Then under Assumption B(a), we have

ES[AERp(𝒜FA)]βμi[m]pi2ni+βR2, (9)

and under Assumption B(b), we have

ESIERi𝒜FAβσ2μ2impi2ni+β3μ2R2. (10)

Proof The proof of (9) is, roughly speaking, based on the fact that the global minimizer of (8) has federated stability γiβpiμni, and thus the first term in the right-hand side of (9) corresponds to the average federated stability i[m]piγi. The second term βR2 in the right-hand side of (9) reflects the presence of heterogeneity. For Equation (10), we were not able to obtain a federated stability based proof, and our current proof is based on an adaptation of the arguments in Theorem 7 of Foster et al. (2019), which explains why the dependence on (σ,β,μ) are different (and slightly worse) compared to Equation (9). In particular, the bound (10) has inverse quadratic dependence on μ, wheres the bound (9) only has 1/μ dependence. The 1/μ dependence comes from the fact that the federated stability term has such dependence, and the 1/μ2 dependence comes from the fact that the 2 estimation error has such dependence. We refer the readers to Appendix C.1 for details. ■

Note that both bounds in the above theorem are minimized by choosing pi=ni/N. With this choice of p, the two bounds read

ESAERp𝒜FAβμN+βR2, ESIERi𝒜FAβσ2μ2N+β3R2μ2. (11)

This makes sense, since this choice of weight corresponds to the ERM objective under complete homogeneity. This observation also suggests that ensuring “fair resource allocation” (i.e., setting pi=1/m) can lead to statistical inefficiency, especially when the sample sizes are imbalanced.

We conclude this subsection by noting that though the compactness assumption (Assumption A(a)) is not needed in Theorem 6, it is usually needed in the analysis of the optimization error of FedAvg and PureLocalTraining(see, e.g., Rakhlin et al. (2011); Li et al. (2020b)).

3.2. Lower Bounds

In this subsection, we present our construction of lower bounds, which characterize the information-theoretic limit of personalized federated learning. Throughout this section, we restrict out attention to the case where pi=ni/N for any i[m].

Our construction starts by considering a special class of problem instances: logistic regression. In logistic regression, given the collection of regression coefficients w(i)𝒲 where 𝒲 has a diameter D, the data distributions 𝒟i’s are supported on Rd×{±1} and specified by a two-step procedure as follows:

  1. Generate a feature vector x, whose coordinates are i.i.d. copies from some distribution PX on R, which is assumed to have mean zero and is almost surely bounded by some absolute constant cX;

  2. Generate the binary label y{±1}, which is a biased Rademacher random variable with head probability 1+exp-xw(i)-1.

The loss function is naturally chosen to be the negative log-likelihood function, which takes the following form:

(w,z)=(w,x,y)=log1+e-yxw.

The following lemma says that Assumption A holds for the aforementioned logistic regression models.

Lemma 7 (Logistic regressions are valid problem instances) The logistic regression problem described above is a class of problem instances that satisfies Assumption A with =cXDd and σ2=β=cX2d/4. Moreover, if m(N/m)c for some c0 and N/mCd for some C>1, then there exists some event which only depends on the features xj(i):i[m],jni and happens with probability at least 1-e-𝒪(N/m), such that on this event, the strongly convex constant in Assumption A satisfies

μμ0=expcXDd/2+exp-cXDd-2. (12)

Proof The compactness of the domain and the boundedness of the loss function hold by construction. To verify the rest parts of Assumption A, with some algebra one finds that

2w,x,y=xxexpyxw1+expyxw214xx, (13)

where is the Loewner order and the inequality holds because x/(1+x)2=1/x-1/2+x1/221/4 for x>0. Since the population gradient has mean zero at optimum, the gradient variance at optimum can be upper bounded by the trace of the expected Hessian matrix, which, by the above display, is further upper bounded by cX2d/4. Thus, we can take σ2=cX2d/4 in Part (c). Another message of the above display is that we can set the smoothness constant in Part (b) to be β=cX2d/4.

The only subtlety that remains is to ensure each local loss function is μ-strongly convex. Note that since x/(1+x)2 is decreasing from (0, 1) and is increasing from (1,), the right-hand side of (13) dominates μ0xx in Loewner order, where μ0 is the right-hand side of (12). Thus, the local population losses E(x,y)~𝒟i[(,x,y)] are all μ0-strongly convex.

Now, note that

2Liw(i),Si=1nijnixj(i)xj(i)expyj(i)xj(i),w(i)1+expyj(i)xj(i),w(i)2μ01nijnixj(i)xj(i).

Invoking Theorem 5.39 of Vershynin (2010) along with a union bound over all clients, we conclude that for any i[m], the minimum eigenvalue of jnixj(i)xj(i) is lower bounded by a constant multiple of ni-pni (this is the definition of the event ) with probability at least 1-me-𝒪ni1-e-𝒪(N/m), and the proof is concluded. ■

Note that in the proof of the above lemma, we have established the μ0μ-strong convexity of the client-wise population losses. Hence, lower bounding the excess risks reduces to lower bounding the 2 estimation errors w^(i)-w(i)2 of the estimators w^(i) for w(i). Such a reduction allows us to use powerful tools from information theory.

To this end, we introduce two parameter spaces, corresponding to Part (a) and (b) of Assumption B. Recalling wp(global)=i[m]piw(i), we define

𝒫1w(i)i=1m𝒲:i[m]piw(i)-wp(global )2R2,
𝒫2w(i)i=1m𝒲:w(i)-wp(global )2R2i[m].

Note that 𝒫1 and 𝒫2 index all possible values of w(i) that can arise in the logistic regression models under Assumption B (a) and (b), respectively.

With the notations introduced so far, we are ready to state the main result of this subsection.

Theorem 8 (Minimax lower bounds for estimation errors) Consider the logistic regression model described above. Suppose nini for any ii[m] and assume pi=ni/N for any i[m]. Then we have

 infw^(i)supw(i)𝒫11Ni[m]niESw^(i)-w(i)2dN/mR2+dN, (14)
infw^(i)supw(i)𝒫2ESw^(i)-w(i)2dniR2+dN (15)

for all i[m], where the infimum is taken over all possible w^(i)s that are measurable functions of the data S.

Proof See Appendix A. ■

Note that both lower bounds in Theorem 8 are a superposition of two terms, and they correspond to two distinct steps in the proof.

The first step in our proof is to argue that the lower bound under complete homogeneity is in fact a valid lower bound under our working assumptions, which gives the Ω(d/N) term. This is reasonable, since estimation under complete homogeneity is, in many senses, an “easier” problem. The proof of the Ω(d/N) term is based on the classical Assouad’s method (Assouad, 1983).

The second step is to use a generalized version of Assouad’s method that allows us to deal with multiple heterogeneous datasets. In particular, we need to carefully choose the prior distributions over the parameter space based on the level of heterogeneity, which ultimately leads to the ΩdN/mR2 term. Recall that in the vanilla version of Assouad’s method where there is only one parameter, say w, one can lower bounds the minimax risk by the Bayes risk, and the prior distribution is usually chosen to be w=δv, where v follows a uniform distribution over all d-dimensional binary vectors and δ is chosen so that the resulting hypothesis testing problem has large type-I plus type-II error. In our case where there are m parameters w(i), we need to consider a different prior of the following form:

w(i)=δiv(i),

where v(i) are i.i.d. samples from the uniform distribution over all d-dimensional binary vectors, and δi’s are scalers that need to be carefully chosen to make the resulting hypothesis testing problem hard.

The following result is an immediate corollary of Theorem 8.

Corollary 9 (Minimax lower bounds for excess errors) Assume there exist constants C,C>0,c0 such that niCβi[m] and mC(N/m)c. Moreover, assume nini for any ii[m] and pi=ni/N for any i[m]. Then there exists an absolute constant c such that the following two statements hold:

1. There exists a problem instance such that Assumptions A and B(a) are satisfied with probability at least 1-e-cN/m. Call this high probability event . On this problem instance, any randomized algorithm 𝒜 must suffer

E𝒜,SAERp(𝒜)1EμβN/mR2+βN; (16)

2. For any i[m], there exists a problem instance such that Assumptions A and B(b) are satisfied with probability at least 1-e-cN/m. Call this high probability event Ei. On this problem instance, any randomized algorithm 𝒜 must suffer

E𝒜,SIERi(𝒜)1EiμβniR2+βN. (17)

In the two displays above, the expectation is taken over the randomness in both the algorithm 𝒜 and the sample S.

Proof Along with Lemma 7 and Theorem 8, this corollary follows by the fact that the smoothness constant β is of the same order as d and the population losses are all μ0μ-strongly convex. ■

3.3. Implications of the Main Results

The upper bounds in Section 3.1 and the lower bounds in Section 3.2 together reveal several intriguing phenomena regarding personalized FL, which we detail in this subsection.

Focusing on the dependence on the sample sizes and assuming the client-wise samples sizes are balanced (i.e., niN/m), the heterogeneity measure R enters the lower and upper bounds in a dichotomous fashion:

  • If R2m/N, then both lower bounds become ΩR2+1/N, and this lower bound can be attained by FedAvg up to factors that do not depend on the sample sizes;

  • If R2m/N, then both lower bounds become Ω(m/N). They agree with the minimax rate as if we were under complete heterogeneity and can be achieved by PureLocalTraining.

Now, let us consider the following naïve dichotomous strategy: if output R2μ. m/N, then output 𝒜=𝒜FA; otherwise, output 𝒜=𝒜PLT. That is, we switch between the two baseline algorithms at the threshold of R2m/N. Then under the assumptions in Theorems 4 and 6, one can readily check that this dichotomous strategy satisfies the following AER guarantee:

ESAERp(𝒜)βμN/mR2+βμi[m]pi2ni. (18)

If in addition, nini for any ii[m], then it also satisfies the following IER guarantee:

ESIERi(𝒜)β3μ2μniR2+βσ2μ2impi2ni. (19)

When pi=ni/N, the two displays above simplify to

ES[AER(𝒜)]βμN/mR2+βμN, ESIERi(𝒜)β3μ2μniR2+βσ2μ2N,

which matches the lower bound in Corollary 9 up to constant factors, provided β,μ,,σ are all of constant order. In other words, switching between the two algorithms at the threshold of R2m/N gives an oracle algorithm that is minimax rate optimal.

Thus, we have shown an interesting property for personalized FL on the choice of the two baseline algorithms. In particular, consider a collection of problem instances indexed by (R,β,μ,,σ) using Assumptions A and B and assume β,μ,,σ are all of constant order. Now, for a fixed value of R, exactly one of these two algorithms is minimax optimal, where the optimality is defined over the specified collection of problem instances and with respect to both AER and IER. Moreover, the oracle dichotomous strategy that switches between the two baseline algorithms at the threshold of R2m/N is minimax optimal.

More implications of the theoretical results are described below.

Optimality of a dichotomous strategy.

From the practical side, for supervised learning problems, such a dichotomous strategy can be implemented without prior knowledge of R if test errors can be evaluated in a distributed fashion. Indeed, we can first run both FedAvg and PureLocalTraining separately, evaluate their test errors (in a distributed fashion), and deploy the one with a lower test error. Due to the upper and lower bounds proved in Sections 3.1 and 3.2, such a strategy is guaranteed to be minimax rate optimal. As a caveat, however, one should refrain from interpreting our results as saying either of the two baseline algorithms is sufficient for practical problems. From a practical viewpoint, constants that are omitted in the minimax analysis are crucial. Even for supervised problems, a better personalization result could be achieved by more sophisticated algorithms in practice. Nevertheless, our results suggest that the two baseline algorithms can at least serve as a good starting point in the search for efficient personalized algorithms.

For unsupervised problems where the quality of a model is hard to evaluate, implementing the dichotomous strategy requires estimating an upper bound R of the level of heterogeneity. This is an important open problem, which we leave for future work.

Optimality of FedAvg followed by local fine tuning.

Another popular baseline algorithm for personalized FL is to first run FedAvg until convergence, and then let each client run PureLocalTraining to fine tune the model. In strongly convex problems, global optima can be reached by gradient descent regardless the initialization with a suitable choice of the learning rate (see, e.g., Theorem 2.1.15 of Nesterov 2018). Thus, if each client run PureLocalTraining for long enough, the global optima for its local loss function will finally be reached. This fact tells that along the whole fine tuning trajectory, there is a point at which the model gives the worst-case optimal AER and IER, and for a fixed level of heterogeneity, this point is either at the very beginning (which is FedAvg), or at the very end (which is PureLocalTraining). Although this conclusion is almost trivial from a technical point of view given our minimax results, it provides a reassuring theoretical property (of being minimax optimal) for a popular method used by practitioners.

Illustrating the minimaxity in a simulated example.

We conduct a simulation on federated logistic regression to corroborate our theoretical results and the optimality of the FedAvg following by local fine tuning strategy. In the simulation, we set m=5,ni=100,i[m] and we vary R from 0 to 20 (see Appendix D for details). In the left panel of Figure 1, we plot the test accuracy (averaged over 100 rounds of simulations) of those three methods against the value of R. One can see that the accuracy of the fine tuning strategy roughly follows the maximum of the accuracies achieved by FedAvg and PureLocalTraining, confirming our theoretical prediction that the fine tuning strategy can indeed perform as well as the best between FedAvg and PureLocalTraining.

Figure 1:

Figure 1:

Average classification accuracy of FedAvg, PureLocalTraining and FedAvg followed by fine tuning (left panel) as well as FedProx with different choice of λ (right panel).

Beyond the current heterogeneity assumption.

Our minimax results are established under Assumption B, which states that all optimal local models are close to a certain “centroid” (i.e., the average global model defined in (7)). If we draw a graph of clients and connect two clients if their optimal local models are similar, then the current heterogeneity assumption gives rise to a complete graph (or a star-shaped graph if we introduce another node to represent the average global model). While such an idealized graph structure enables a clean theoretical analysis, the real world proximity patterns among clients are clearly far more sophisticated. In fact, counterexamples exist under which the minimax results does not hold in a “global” sense.

Suppose all m clients exhibit a clustering structure as follows. We have m clients whose optimal global models serve as cluster centroids, and those centroids are very far apart. In the neighborhood of each centroid, there are m clients whose optimal local models are Rc away from the centroid in the sense of Assumption B. Additionally, assume each client has equal sample sizes, so that ni=n for some n. Under this setting, the “global” heterogeneity parameter R in Assumption B is very large, so our theory would suggest choosing PureLocalTraining, which gives a rate of 𝒪(1/n). However, this rate is clearly suboptimal. If one can successfully cluster the m clients into m clusters (which is hopeful as the centroids are assumed to be far apart), then one can apply our theory to each cluster (i.e., run the dichotomous strategy for each cluster) and conclude that the rate for each cluster is 𝒪1nm+1nRc21n if m diverges to infinity and Rc1/n. The foregoing discussion reveals in such a clustered setting, our theoretical results only make sense at the cluster level, but not at the global level.

The behaviors of the minimax rate for more general client proximity graphs can be even more complicated, which we leave for future work.

3.4. More Implications of Federated Stability and Analysis of FedProx

In this subsection, we are concerned with the performance guarantees for FedProx. As with our earlier analysis of FedAvg, we consider a p-weighted version of FedProx, whose optimization formulation is given below:

minw(global)𝒲w(i)i=1m𝒲i[m]piLiw(i),Si+λ2w(global)-w(i)2, (20)

where we recall that Liw,Sijniw,zj(i)/ni is the ERM objective for the i-th client. In this subsection, we let w˜(global),w˜(i) be the global minimizer of the above problem. Compared to (8), which imposes a “hard” constraint w(i)=w(global), and compared to (3), where there is no constraint at all, the above formulation imposes a “soft” constraint that the norm of w(global)-w(i) should be small, with a hyperparameter λ controlling the strength of this constraint.

The rationale behind the optimization formulation (20) of FedProx is clear: by setting λ=0, the optimization formulation of PureLocalTraining (3) is recovered, and as λ, the optimization formulation of FedAvg (8) is recovered. The hope is that by varying λ(0,), one can interpolate between the two extremes.

Applying the idea of local SGD to (20), one obtains the FedProx algorithm1, which we detail in Algorithm 2. We separate the whole algorithm into two stages as they has distinct interpretations: in Stage I, the central server aims to learn a good global model with the help of local clients, whereas in Stage II, each local client takes advantage of the global model to personalize. Alternatively, one can also interpret FedProx as an instance of the general framework of model-agnostic meta learning (Finn et al., 2017), where Stage I learns a good initialization, and Stage II trains the local models starting from this initialization.

In contrast to our analyses for FedAvg and PureLocalTraining in Section 3.1, where we largely focused on global minimizers, the analysis for FedProx will be carried out for the approximate minimizer output by Algorithm 2. The reason for this is rooted in the tradeoff between the optimization error and the generalization error. Note that given the results derived in Section 3.1, the analysis for the global minimizer w˜(global),w˜(i) becomes trivial: by setting λ=0, we reduce the task to analyzing PureLocalTraining; by sending λ, we reduce the task to analyzing FedAvg. Based on Theorems 4 and 6, one immediately concludes that there exists a choice of λ, such that the AER and IER of w˜(i) satisfy the bounds in (18) and (19), respectively. However, the foregoing discussion is purely restricted to generalization error. When we set λ=0 or send λ, it is not known a priori whether FedProx algorithm will converge to the global minima. Worse still, the optimization error may depends on λ in a particular way so that it becomes unbounded when λ approaches zero or infinity. To the best of our knowledge, prior work only proved the optimization convergence of FedProx for the global model with a fixed value of λ, namely the convergence of wT(global) to w˜(global) as the number of global communication rounds T tends to infinity (Li et al., 2018; Dinh et al., 2020). To have a theoretical understanding of the performance of FedProx, it is crucial to (1) establish the optimization convergence for both global and local models; (2) bound the generalization error; and (3) balance the optimization error and the generalization error, both of which are functions of λ. In the following, we execute the those steps with the aid of federated stability.

3.

Implications of federated stability for FedProx.

We have briefly mentioned the main implications of federated stability in Section 3.1.2: for an algorithm 𝒜=w^(i) with federated stability γi, its average generalization error (resp. individualized generalization error) can be upper bounded by 𝒪i[m]piγi (resp. 𝒪γi), plus a term scaling with the level of heterogeneity R. We make such a statement precise here. Let us first define the optimization error of a generic algorithm 𝒜=w^(global),w^(i) (which tries to solve (20)) as

OPTi[m]piLiw^(i),Si+λ2w^(global )-w^(i)2-i[m]piLiw˜(i),Si+λ2w˜(global )-w˜(i)2.

The main implications of federated stability, when applied to the specifics of FedProx, can then be summarized in the following proposition.

Proposition 10 (Implications of federated stability restricted to FedProx) Consider an algorithm 𝒜=w^(global),w^(i) with federated uniform stability γi1m. Then we have

E𝒜,SAERp(𝒜) E𝒜,SOPT+2i[m]piE𝒜,Sγi+λ2i[m]piwavg(global)-w(i)2, (21)
E𝒜,SIERi(𝒜)E𝒜,SOPTpi+2E𝒜,Sγi+λ2E𝒜,Sw^(global)-w(i)2 i[m]. (22)

Proof The proof of (21) is based on the following basic inequality for the AER:

i[m]piLiw˜(i),Si+λ2w˜(global)-w˜(i)2i[m]piLiw(i),Si+λ2wavg(global)-w(i)2, (23)

whereas the proof of (22) is based on the following basic inequality for the IER: for any s[m], we have

i[m]piLiw˜(i),Si+λ2w˜(global)-w˜(i)2psLsw(s),Ss+λ2w^(global)-w(s)2+ispiLiw^(i),Si+λ2w^(global)-w^(i)2. (24)

We refer the readers to Appendix C.2 for details. ■

Note that both bounds in Proposition 10 involve a term that scales linearly with both λ and the heterogeneity measure. In general, we expect the stability measures to scale inversely with λ, and thus opening the possibility of carefully choosing λ to balance the stability term and the heterogeneity term.

Let us observe that the heterogeneity term of (22) is slightly different than that of (21), in that it involves the estimated global model w^(global). This suggests that achieving the IER guarantees might be intrinsically more difficult than achieving the AER guarantees.

In view of Proposition 10, we are left to bound the optimization error and the federated stability of FedProx. As discussed above, achieving the AER and IER guarantees requires somewhat different assumptions, as the latter involves characterizing the performance of the global model. So we split our discussion into two parts below.

Bounding the average excess error.

The following theorem characterize the performance of FedProx in terms of the AER.

Theorem 11 (AER guarantees for FedProx) Let Assumptions A and B(a) hold, and assume ni4β/μ for all i[m]. Choose the weight vector p such that

pmaxi[m]pi/nii[m]pi2/niCp (25)

for some constant Cp, where pmax=maxi pi. Consider the FedProx algorithm, 𝒜FP, with the following hyperparameter configuration:

  1. In the joint training stage (i.e., 0tT-1), set
    ηt,k(i)=1(μ+λ)(k+1), ηt(global)=2(μ+λ)λμ(t+1), Kt+1C1λ21t,TC2λ(λ1)mp2i[m]pi/ni-1λ(λ1)nmax2; (26)
  2. In the final training stage (i.e., t=T), set
    ηT,k(i)=1(μ+λ)(k+1),KTC3(λ+1)2impini-1λ2maximpini2, (27)

where C1,C2,C3 are constants depending only on μ,β,,D. Then, there exists a choice of λ such that

E𝒜FP,SAERp𝒜FPμ1Cp+1CpβμRimpiniimpini+impi2ni. (28)

Proof See Appendix C.3. ■

A few remarks are in order. First, (25) essentially says that the weight p cannot be too imbalanced, and too much imbalance in p can hurt the performance in view of the multiplicative factor of Cp in our bound (28). If we set pi=1/m, then Cp is naturally of constant order; whereas if we set pi=ni/N, we have Cpmnmax/N, where nmax=maxini, which calls for relative balance of the sample sizes.

We then briefly comment on the hyperparameter choice in the above theorem. The step sizes are of the form 1/(strongly convex constant × iteration counter), and such a choice is common in strongly convex stochastic optimization problems (see, e.g., Rakhlin et al. 2011; Shamir and Zhang 2013). Such a choice, along with the smoothness of the problem, is also the key for us to by-pass the need of doing any time-averaging operation, as is done in, for example, Dinh et al. (2020).

In Theorem 11, the choice of the communication rounds T and the final local training round KT both scale polynomially with λ, which means that the optimization convergence of FedProx is slower when the data are less heterogeneous. This phenomenon happens more generally. For example, in Hanzely and Richtárik (2020), they proposed a variant of SGD that optimizes (20) with pi=1/m in 𝒪L+λμlog1/ε-many iterations, where L is the Lipschitz constant of the loss function and ε is the desired accuracy level.

The constants C1,C2,C3 in the statement of Theorem 11 can be explicitly traced in our proof. We remark that the dependence on problem-specific constants μ,β,,D in our hyperparameter choice and on λ may not be tight. A tight analysis of the optimization error is interesting, but less relevant for our purpose of understanding the sample complexity. So we defer such an analysis to future work2.

Bounding individualized excess errors.

The following theorem gives the IER guarantees for FedProx.

Theorem 12 (IER guarantees for FedProx) Let Assumptions A and B(b) hold. Moreover, assume that nini for any ii[m] and ni4β/μi[m]. Let the weight vector be chosen as pi1/mi[m]. Consider the FedProx algorithm, 𝒜FP, with the following hyperparameter configuration:

  1. In the joint training stage (i.e., 0tT-1), set ηt,k(i),ηt(global),Kt as in (26), and set
    TC2λ(λ1)max i[m]nipi-1λ(λ1)ni;
  2. In the final training stage (i.e., t=T), set ηT,k(i) as in (27), and set
    KTC3(λ+1)2max i[m]nipi-1λ2pi2ni,

where C2,C3 are constants only depending on μ,β,,D. Then, there exists a choice of λ such that for any i[m], we have

E𝒜FP,SIERi𝒜FPμ+μ-1β+σ2β2+β2+σ2μ2+μD2Rni1ni+mN. (29)

Proof See Appendix C.4. ■

Compared to Theorem 11, the above theorem imposes extra assumptions that the sample sizes are relative balanced and that pi1/m, both of which are due to the fact that we need to additionally take care of the estimation error of the global model. The hyperparameter choice slightly differs from that in Theorem 11 for the same reason. In practice, when one is to use FedProx to optimize highly non-convex functions like the loss function of deep neural networks, instead of sticking to the choices made in Theorems 11 and 12, the hyperparameters are usually tuned by trial-and-error for best test performance.

Comparison with the lower bounds.

In order to comment about the optimality/suboptimality of FedProx, let us restrict to the case when pi=ni/N. In this case, the bound in Theorem 11 becomes

E𝒜FP,SAERp𝒜FPμ+βμ1N/mRN/m+1N. (30)

Recall the lower bound in (16). Focusing on the dependence of sample sizes and heterogeneity measure, we have the following three cases. If R2m/N, then (30) becomes 𝒪(m/N), which matches the lower bound. Meanwhile, if 1/mNR2m/N, then (30) becomes 𝒪(m/N), whereas the lower bound reads ΩR2+1/N, and thus (30) is suboptimal unless R2m/N. Moreover, if R21/mN, then (30) becomes 𝒪(1/N), and is minimax optimal again.

A similar trilogy holds for IER of FedProx. Comparing the upper bound in (29) and the lower bound in (17), we still have three cases as follows. If R2m/N, then (29) is 𝒪1/ni, which agrees with the lower bound. Meanwhile, if 1/NR2m/N, then (29) is 𝒪R/ni, and is suboptimal compared to the ΩR2+1/N lower bound unless R2m/N. Moreover, if R21/N, then (29) is 𝒪(m/N), and is off by a factor of order m compared to the Ω(1/N) lower bound.

While the bounds in Theorems 11 and 12 in general do not attain the lower bounds in Corollary 9, they are still non-trivial in the sense that they scale with the heterogeneity measure R. While there are some recent works establishing the AER guarantees for an objective similar to (20) under the online learning setup (see, e.g., Denevi et al. 2019; Balcan et al. 2019; Khodak et al. 2019), to the best of our knowledge, Theorems 11 and 12 are the first to establish both the AER and IER guarantees for (20) under the federated learning setup.

Curious readers may wonder if the suboptimality of the theoretical guarantees for FedProx (with non-zero λ) is a characteristic of this algorithm or if it is due to the artifact of our technical proof. To answer this question, we conduct a simulation where we apply FedProx with different λs on datasets generated by federated logistic regression (see Appendix D for details). The accuracies versus different values of R is shown in the right panel of Figure 1. As expected by our theory, the performance of FedProx with λ=0 mimics that of PureLocalTraining, whereas the performance with λ=4 resembles that of FedAvg. Interestingly, FedProx with λ=0.44 bears a similar performance with the FedAvg followed by fine tuning strategy, which we know is minimax optimal. This observation supports the conjecture that optimally tuned FedProx is indeed minimax optimal, and the suboptimality of bounds from Theorems 11 and 12 are likely to be a consequence of the artifact of our theoretical analysis.

4. Discussion

This paper studies the statistical properties of personalized federated learning. Focusing on strongly-convex, smooth, and bounded empirical risk minimization problems, we have uncovered an intriguing phenomenon that given a specific level of heterogeneity, exactly one of FedAvg or PureLocalTraining is minimax optimal. In the course of proving this result, we obtained a novel analysis of FedProx and introduced a new notion of algorithmic stability termed federated stability, which is possibly of independent interest for analyzing generalization properties in the context of federated learning.

We close this paper by mentioning several open problems.

  • Dependence on problem-specific parameters. This paper focuses on the dependence on the sample sizes, and in our bounds, the dependence on problem-specific parameters (e.g., the smoothness and strong convexity constants) may not be optimal. This can be problematic if those parameters are not of constant order, and it will be interesting to give a refined analysis that gives optimal dependence on those parameters.

  • A refined analysis of FedProx. The upper bounds we develop for FedProx, as we have mentioned, do not match our minimax lower bounds. According to a simulated example, we suspect that this is an artifact of our analysis and a refined analysis of FedProx would be a welcome advance.

  • Estimation of the level of heterogeneity and development of adaptive algorithms. For unsupervised problems where evaluation of a model is difficult, implementation of the oracle dichotomous strategy described in Section 3.3 would require estimating the level of heterogeneity R. Even for supervised problems, estimation of R would be interesting, as it allows one to decide which algorithm to choose without model training. More generally, developing adaptive algorithms that attains the lower bound without prior information of R is an important open problem.

  • Beyond the current heterogeneity assumption. As discussed in Section 3.3, our theoretical results may not hold globally when one moves from Assumption B to more general heterogeneity assumptions. Establishing the minimax rates and designing provably optimal algorithms under those assumptions are of both theoretical and practical interest.

  • Beyond convexity. Our analysis is heavily contingent upon the strong convexity of the loss function, which, to the best of our knowledge, is not easily generalizable to the non-convex case. Meanwhile, our notion of heterogeneity, which is based on the distance of optimal local models to the convex combination of them, may not be natural for non-convex problems. It is of interest, albeit difficult, to have a theoretical investigation of personalized federated learning for non-convex problems.

Acknowledgments

This work was supported in part by NIH through R01-GM124111 and RF1-AG063481, NSF through CAREER DMS-1847415, CCF-1763314, and CCF-1934876, and an Alfred P. Sloan Research Fellowship.

A. Proof of Theorem 8: Lower Bounds

We start by presenting a lower bound when all w(i)′s are the same.

Lemma 13 (Lower bound under homogeneity) Consider the logistic regression model with w(i)=wp(global) for any i[m]. Then

infw^(global) supwp(global )ESw^(global )-wp(global )2dN.

Proof This is a classical result. See, e.g., Example 8.4 of Duchi (2019). ■

Proof [Proof of (14)] We first give a lower bound based on the observation that the homogeneous case is in fact included in the parameter space 𝒫1. More explicitly, let us define 𝒫0=w(i)𝒫1:w(i)=wp(global)i[m]. By Lemma 13, we have

infw^(i)supw(i)𝒫1i[m]piESw^(i)-w(i)2infw^isupwi𝒫0impiESw^i-wi2=infw^globalsupwpglobalESw^global-wpglobal2dN. (31)

We now use a variant of Assouad’s method (Assouad, 1983) that allows us to tackle multiple datasets. Consider the following data generating process: nature generates V=v(i):i[m] i.i.d. from the uniform distribution on 𝒱={±1}d and sets w(i)=δiv(i) for some δi such that the following constraint is satisfied:

i[m]piw(i)-wp(global)2=i[m]piδiv(i)-s[m]psδsv(s)2R2. (32)

We will specify the choice of δi’s later. Denoting EX as the marginal expectation operator with respect to all the features xj(i) and EY|X as the conditional expectation operator with respect to yj(i)|xj(i), we can lower bound the minimax risk by the Bayes risk as follows:

inf{w^(i)}sup{w(i)}𝒫i[m]piESw^(i)w(i)2inf{w^(i)}E{v(i)}i[m]piESw^(i)δiv(i)2=inf{v^(i)}𝒱i[m]piEV,Sδiv^(i)δiv(i)2EXi[m]piδi2infv^(i)𝒱EV,Y|Xv^(i)v(i)2EXi[m]piδi2k[d]infv^k(i){±1}EV,Y|X(v^k(i)vk(i))2EXi[m]piδi2k[d]infv^k(i){±1}V,Y|X(v^k(i)vk(i))=12EXi[m]piδi2k[d]infv^k(i){±1}(i,+k(v^k(i)=1)+i,k(v^k(i)=+1))

where in the last line, we have let Pi,±k()=PV,Y|X|vk(i)=±1 to denote the probability measure with respect to the randomness in (V,S) conditional on the features xj(i) as well as the realization of vk(i)=±1. More explicitly, we can write

Pi,±k=siPvsPysj=1nsvs,xjsj=1nsPvivki=±1Pyjij=1nivi,vki=±1,xjij=1ni=12(m-1)d+d-1Vvk(i)PV,i,±k,

where the symbol stands for taking the product of two measures and PV,i,±k corresponds to the law of all the labels Y conditional on a specific realization of V:vk(i)=±1 and the features X. With the current notations and letting P-QTV be the total variation distance between two probability measures P and Q, we can invoke Neyman-Pearson lemma to get

infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2 EXi[m]piδi2k[d]1-Pi,+k-Pi,-kTV=di[m]piδi2-EXi[m]piδi2k[d]Pi,+k-Pi,-kTV. (33)

We then proceed by

i[m]piδi2k[d]Pi,+k-Pi,-kTVi[m]piδi2dk[d]Pi,+k-Pi,-kTV21/2=i[m]piδi2dk[d]12(m-1)d+d-1Vvk(i)PV,i,+k-PV,i,-kTV21/2=i[m]piδi2dk[d]12(m-1)d+d-1Vvk(i)PV,i,+k-PV,i,-kTV21/2,

where the last inequality is by convexity of the total variation distance. Note that PV,i,±k is the product of biased Rademacher random variables: if we let Rad(p) be the ±1-valued random variable with positive probability p, we can write

PV,i,±k=s[m]jnsRad11+exp-δsvs,xjs, vki=±1.

Thus, by Pinsker’s inequality, we have

PV,i,+k-PV,i,-kTV212DJSPV,i,+kPV,i,-k=12sijns0+12jniDJSRad11+exp-δiv(i),xj(i)Rad11+exp-δiv˜(i),xj(i),

where DJS(PQ)=DKL(PQ)+DKL(QP)2 is the Jensen-Shannon divergence between P and Q, and v(s),v˜(s) are two 𝒱-valued vectors that only differs in the k-th coordinate. By a standard calculation, one finds that

DJSRad11+exp-δiv(i),xj(i)Rad11+exp-δiv˜(i),xj(i)δi2vk(i)-v˜k(i)2xj,k(i)2=4δi2xj,k(i)2.

This gives

PV,i,+k-PV,i,-kTV22δi2jnixj,ki22δi2cX2ni.

and hence

i[m]piδi2k[d]Pi,+k-Pi,-kTV2cXi[m]piδi3dni1/2.

Plugging the above display to (33) gives

infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2di[m]piδi2-2cXi[m]piδi3ni. (34)

To this end, all that is left is to choose δi approriately so that (1) the above display is as tight as possible; (2) (32) is satisfied. We consider the following two cases:

  1. Assume R2di[m]pi/ni=dm/N. Note that we can re-write the requirement (32) to be
    di[m]piδi2-impiδivi2R2.
    Under the current assumption, this requirement will be satisfied if we choose δi= c/ni for any c1. Under such a choice, the right-hand side of (34) becomes c2dmNc-2cX. Thus, by setting c=22cX, we get the following lower bound:
    infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2dN/m.
  2. Assume R2di[m]pi/ni=dm/N. Note that if we set δiδ=cR/d where c1, (32) reads
    c2R2-impiδivi2R2,
    which trivially holds. Now, the right-hand side of (34) becomes
    c2R21-2ccXi[m]piRni/d.
    Since pi=ni/N and niN/m, our assumption on R gives
    2ccXimpiRni/diniNmniN=1.
    This means that we can choose c to be a small constant such that the following lower bound holds:
    infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2R2.

Summarizing the above two cases, we arrive at

infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2dN/mR2.

Combining the above bound with (31), we get

infw^(i)supw(i)𝒫i[m]piESw^(i)-w(i)2dN/mR2+dN,

which is the desired result. ■

Proof [Proof of (15)] The proof is similar to the proof of (16), and we only provide a sketch here. Without loss of generality we consider the first client. By the same arguments as in the proof of (16), the left-hand side of (15) is lower bounded by a constant multiple of d/N. Now, by considering the same prior distribution on 𝒫 as in the proof of (16), we get

infw-(1)supw(i)𝒫ESw-(1)-w(1)2dδ121-δ1n1,

where the δi ‘s should obey the following inequality:

δiv(i)-s[m]psδsv(s)2R2.

Choosing δi1/ni when Rdm/N and δiR/d otherwise, we arrive at

infw-(1)supw(i)𝒫ESw-(1)-w(1)2dn1R2,

and the proof is concluded. ■

B. Optimization Convergence of FedProx

This section concerns the optimization convergence of FedProx. We first introduce some notations. Let wt,k(i) be the output of k-th step of Algorithm 2 when the initial local model is given by wt(i)wt,0(i)wt-1,K(i), let It,k(t) be the corresponding minibatch taken, and denote the initial global model by wt(global). Let t,k be the sigma algebra generated by the randomness by Algorithm 2 up to wt,k(i), namely the randomness in 𝒞τ,Iτ,l(i):i𝒞τ,0lKτ-1τ=0t-1,𝒞t, and It,l(i):i𝒞t,0lk-1. For notational convenience we let 𝒞T=[m] (i.e., all clients are involved in local training in Stage II of Algorithm 2). Then the sequence wt,k(i) is adapted to the following filtration:

0,00,10,K1,01,11,KT,K.

We write the optimization problem (20) as

minw(global)𝒲i[m]piFiw(global),Si, (35)

where

Fiw(global),Si=minw(i)𝒲Liw(i),Si+λ2w(global)-w(i)2. (36)

To simplify notations, we introduce the proximal opertor

ProxLi/λw(global)=ProxLi/λw(global),Si=argmin w(i)𝒲Liw(i),Si+λ2w(global)-w(i)2. (37)

The high-level idea of this proof is to regard λi𝒞twt(global)-wt+1(i)/B(global) as a biased stochastic gradient of 1ni[m]Fiwt(global),Si. This idea has appeared in various places (see, e.g., the proof of Proposition 5 in Denevi et al. (2019) and the proof of Theorem 1 in Dinh et al. (2020)). However, the implementation of this idea in our case is more complicated than the above mentioned works in that (1) we are not in an online learning setup (compared to Denevi et al. (2019)); (2) we don’t need to assume all clients are training at every round (compared to Dinh et al. (2020)); and (3) we use local SGD for the inner loop (instead of assuming the inner loop can be solved with arbitrary precision as assumed in Dinh et al. (2020)), so the gradient norm depends on λ, and could in principle be arbitrarily large, which causes extra complications.

Lemma 14 (Convergence of the inner loop) Let Assumption A(a, b) holds. Choose ηt,k(i)=1(μ+λ)(k+1). Then for any k0, we have

Ewt,k(i)-ProxLi/λwt(global )2|t,0,i𝒞t8β2D2μ2k+1.

Proof See Appendix B.1.■

Lemma 15 (Convergence of the outer loop)Let the assumptions inLemma 14hold. Chooseηt(global)=2(μ+λ)λμ(t+1)and assume
Kτ+14τ+20λ2β2D2μ2β2D22λλ2D2 0τt-1. (38)

Then for any t0, we have

E𝒜FPwt(global)-w˜(global)212(λ+μ)2mp2β2D22λλ2D2λ2μ2t+1, (39)

where the expectation is taken over the randomness in Algorithm 2.

Proof See Appendix B.2.■

Proposition 16 (Optimization error of 𝒜FP) Under the assumptions of Lemma 14 and 15, for any dataset S~i𝒟ini, we have

E𝒜FPOPT4(β+λ)β2D2μ2KT+1+6(λ+μ)2mp2β2D22λλ2D2λμ2(t+1)

Proof By definition we have

E𝒜FPOPTE𝒜FPi[m]piLiwT+1(i),Si+λ2wT(global )-wT+1(i)2-i[m]piFiw˜(global ),Si( a) i[m]pi(β+λ)2E𝒜FPwT+1(i)-ProxLi/λwT(global )2+E𝒜FPi[m]piFiwT(global) ,Si-i[m]piFiw˜(global) ,Si( b) 4(β+λ)β2D2μ2KT+1+λ2E𝒜FPwT(global) -w˜(global) 2 (c) 4(β+λ)β2D2μ2KT+1+6(λ+μ)2mp2β2D22λλ2D2λμ2(t+1).

where (a) is by smoothness of Li, (b) is by Lemma 14 and λ-smoothness of i[m]piFi (which holds by Lemma 17), and (c) is by Lemma 15.■

B.1. Proof of Lemma 14: Convergence of the Inner Loop

The proof is an adaptation of the proof of Lemma 1 in Rakhlin et al. (2011). However, we need to deal with the extra complication that the hyperparameter λ can in principle be arbitrarily large. We start by noting that

wt,k+1(i)-ProxLi/λwt(global)2 =𝒫𝒲wt,k(i)-ηt,k(i)B(i)jIt,k(i)wt,k(i),zj(i)+λwt,k(i)-wt(global)-ProxLi/λwt(global)2wt,ki-ηt,kiBijIt,kiwt,ki,zji+λwt,ki-wtglobal-ProxLi/λwtglobal2=wt,ki-ProxLi/λwtglobal 2+ηt,kiBijIt,kiwt,ki,zji+λwt,ki-wtglobal2-2wt,ki-ProxLi/λwtglobal,ηt,kiBijIt,kiwt,ki,zji+λwt,ki-wtglobal,

where the inequality is because ProxLi/λwt(global)𝒲 and 𝒫𝒲 is non-expansive. Now by strong convexity and unbiasedness of the stochastic gradients, we have

E[wt,k(i)ProxLi/λ(wt(global )),1B(i)jIt,k(i)((wt,k(i),zj(i))+λ(wt,k(i)wt(global )))|t,k,i𝒞t](Li(wt,k(i),Si)+λ2wt,k(i)w(global )2)(Li(ProxLi/λ(wt(global )),Si)+λ2ProxLi/λ(wt(global ))wt(global )2)+12(μi+λnmni)wt,k(i)ProxLi/λ(wt(global ))2(μ+λ)wt,k(i)ProxLi/λ(wt(global) )2.

On the other hand, applying Lemma 19 gives

Eηt,k(i)B(i)jIt,k(i)wt,k(i),zj(i)+λwt,k(i)-wt(global )2 t,k,i𝒞t=ηt,k(i)2ni/B(i)-1nini-1jniwt,k(i),zj(i)-wt,k(i),z(i)-2+1nijniwt,k(i),zj(i)+λwt,k(i)-wt(global )22ηt,k(i)2β2D2ni/B(i)-1ni-1+β+λnmni2wt,k(i)-ProxLi/λwt(global )2,

where in the second line we let wt,k(i),z.(i)-jniwt,k(i),zj(i)/ni, and in the last line is by the β-smoothness of (,z). Thus, we get

Ewt,k+1(i)-ProxLi/λwt(global)2|t,k,i𝒞t 1-2ηt,ki(μ+λ)+ηt,ki2(β+λ)2wt,ki-ProxLi/λwtglobal2 +2ηt,ki2β2D2niB(i)-1ni-1. (40)

We then proceed by induction. Note that if k+18β2μ2, then we have the following trivial bound:

Ewt,k(i)-ProxLi/λwt(global)2|t,0,i𝒞tD28β2D2μ2k+1, (41)

where the first inequality is by wt,k(i),ProxLi/λwt(global)𝒲 and the second inequality is by our assumption on k. Thus, it suffices to show

Ewt,k+1(i)-ProxLi/λwt(global)2|t,0,i𝒞t8β2D2μ2(k+2) (42)

based on the inductive hypothesis (41) and k+18β2/μ2. By the recursive relationship (40) and taking expectation, we have

Ewt,k+1(i)-ProxLi/λwt(global )2|t,0,i𝒞t 1-2ηt,ki(μ+λ)+ηt,ki2(β+λ)28β2D2k+1+2ηt,ki2β2D2ni/B(i)-1ni-1.

Hence (42) is satisfied if

8β2D21k+2-1k+1+2ηt,kik+1(μ+λ)-ηt,ki2k+1(β+λ)22ηt,ki2β2D2ni/B(i)-1ni-1.

By our choice of ηt,k(i), the above display is equivalent to

8β2D2-1(k+1)(k+2)+2(k+1)2-1(k+1)3β+λμ+λ22β2D2(μ+λ)2(k+1)2ni/B(i)-1ni-1

which is further equivalent to

8β2D2-k+1k+2+2-1k+1β+λμ+λ22β2D2(μ+λ)2ni/B(i)-1ni-1.

We now claim that

1k+1β+λμ+λ212.

Indeed, since k+18β2/μ2, (1) if λβ, then the left-hand side above is less than 4β2μ2(k+1)12; and (2) if λβ, the left-hand side above is less than 4k+1μ22β212. By the above claim, (42) would hold if

4β2D22β2D2(μ+λ)2ni/B(i)-1ni-1.

We finish the proof by noting that the right-hand side above is bounded above by 2β2D2μ2.

B.2. Proof of Lemma 15: Convergence of the Outer Loop

By construction we have

wt(global )-w˜(global) 2=λmηt(global )B(global )i𝒞tpiwt(global )-wt+1(i)2=wt(global )-w˜(global )2+λmηt(global )B(global )i𝒞tpiwt(global )-wt+1(i)2-2wt(global )-w˜(global ),λmηt(global )B(global )i𝒞tpiwt(global )-wt+1(i)wt(global )-w˜(global) 2-2wt(global )-w˜(global) ,λmηt(global) B(global )i𝒞tpiwt(global )-ProxLi/λwt(global )I+2λmηt(global )B(global) i𝒞tpiwt(global) -ProxLi/λwt(global) 2II +2λmηt(global )B(global) i𝒞tpiProxLi/λwt(global )-wt+1(i)2III -2wt(global )-w˜(global ),λmηt(global )B(global) i𝒞tpiProxLi/λwt(global )-wt+1(i) IV .

We first consider Term I. Note that λmB(global)i𝒞tpiwt(global)-ProxLi/λwt(global) is an unbiased stochastic gradient of ipiFi, which is μF=λμ/(λ+μ)-strongly convex. Thus, we have

EI|t-1,Kt-1 =2ηt(global )wt(global )-w˜(global ),i[m]piFiwt(global ),Si2ηt(global )μFwt(global )-w˜(global )2.

Now for Term II, we have

E II |t-1,Kt-12ηt(global) 2E1B(global )i𝒞tmpi2 t-1,Kt-1maxi[m]Fiwt(global ),Si22ηt(global )2E1B(global )i𝒞tmpi2 t-1,Kt-1maxi[m]β2D22λλ2D22ηt(global )21mi[m]mpi-12+1β2D22λλ2D2=2ηt(global )2mp2β2D22λλ2D2,

where the second line is by Lemma 18 and the third line is by Lemma 19. For Term III, we invoke Lemma 14 to get

E III |t-1,Kt-1 2λ2ηt(global )28β2D2μ2Kt+1E1B(global) i𝒞tmpi2 t-1,Kt-116λ2ηt(global )2β2D2mp2μ2Kt+1,

where the last line is again by Lemma 19. For Term IV, we invoke Young’s inequality for products to get

E-IV|t-1,Kt-1ηt(global)μFwt(global)-w˜(global)2+ηt(global)μF-1EIII2 t-1,Kt-1ηt(global)μFwt(global)-w˜(global)2+ηt(global)μF-18λ2ηt(global)2β2D2mp2μ2Kt+1.

Summarizing the above bounds on the four terms, we arrive at

Ewt+1(global)-w˜(global) 2|t-1,Kt-11-ηt(global)μFwt(global)-w˜(global)2+2ηt(global)2mp2β2D22λλ2D2V +λ2ηt(global)2β2D2mp2μ2Kt+116+8δt(global)μFVI.

We claim that VIV. Indeed, with our choice of ηt(global)=2μF(t+1), with some algebra, one recognizes that this claim is equivalent to

20+4tμ2Kt+11λ22λβ2D21β2,

which is exactly (38). Thus, we have

Ewt+1(global)-w˜(global)2|t-1,Kt-11-ηtglobalμFwtglobal-w˜(global)2+3V=1-2t+1wtglobal-w˜global2+12mp2β2D22λλ2D2μF2(t+1)2. (43)

We then proceed by induction. For the base case, we invoke the strong convexity of ipiFi and Lemma 18 to get

μF24w0(global )-w˜(global) 2impiFiw0(global) ,Si2β2D22λλ2D2.

Along with the fact that 1=i[m]pi2mp2, we conclude that (15) is true for t=0. Now assume (39) hold for any 0tτ. For t=τ+1, using (43) and the inductive hypothesis, we have

E𝒜FPwτ+1(global )-w˜(global) 21-2τ+112mp2β2D22λλ2D2(τ+1)μF2+12mp2β2D22λλ2D2(τ+1)2μF2=1τ+1-1(τ+1)212mp2β2D22λλ2D2μF212mp2β2D22λλ2D2(τ+2)μF2,

which is the desired result.

B.3. Auxiliary lemmas

Lemma 17 (Convexity and smoothness Fi) Under Assumption A(b), each Fi is λ-smooth and μλμ+λ-strongly convex.

Proof The smoothness is a standard fact about the Moreau envelope. The strongly convex constant of Fi follows from Theorem 2.2 of Lemaréchal and Sagastizábal (1997). ■

Lemma 18 (A priori gradient norm bound) Under Assumption A(a, b), for any w𝒲 and i[m], we have

Fiw,Si2β2D22λλ2D2.

Proof Since Fiw,Si=λw-ProxLi/λ(w), its norm is trivially bounded by λD. Now, since ProxLi/λ(w) achieves a lower objective value than w for the objective function Li,Si+λ2w-2, we have

λ2wProxLi/λ(w)2Li(w,Si)Li(ProxLi/λ(w),Si),

and hence Fiw,Si22λ. Finally, by the first-order condition, we have

LiProxLi/λ(w),Si+λProxLi/λ(w)-w=0.

Hence, we get Fiw,Si=LiProxLi/λ(w),SiβD. ■

Lemma 19 (Variance of minibatch sampling) Let [n] be a randomly sampled batch with batch size B and let xii=1nRd be an arbitrary set of vectors, then

E1Bixi2=n/B-1n(n-1)i[n]xi-x2+x21ni[n]xi-x2+x2,

where xi[n]xi/n.

Proof Since Eixi/B=x, we have

E1Bixi2=E1Bixi-x2+x2=1B2i[n]1{i}xi-x2+2i<j1{i,j}xi-x,xj-x+x2=1B2Bni[n]xi-x2+2B(B-1)n(n-1)i<jxi-x,xj-x+x2,

where the last line is by P(i)=B/n and P(i,j)=B(B-1)n-1(n-1)-1 for any ij. Now, since i[n]xi-x2+2i<jxi-x,xj-x=0, we arrive at

E1Bixi2=1B2Bn-BB-1nn-1ixi-x2+x2 =n/B-1n(n-1)i[n]xi-x2+x2,

which is the desired result. ■

C. Proofs of Upper Bounds

C.1. Proof of Theorem 6

In this proof, we let w^(global) be the global minimizer of (8) and we write wavg(global,p)wp(global) when there is no ambiguity.

Proof [Proof of (9)] We have

0=- i[m]piLiw^(global )(S),Si+i[m]piLiw^(global )(S),Si- i[m]piLiw^(global )(S),Si+i[m]piLiw(1),Si=- i[m]pinijniw^(global )S(i,j),zj(i)-w(1),zj(i)+i[m]pinijniw^(global )S(i,j),zj(i)-w^(global )(S),zj(i),

where S(i,j) stands for the dataset formed by replacing zj(i) by another zi,j~𝒟i, which is independent of everything else. Taking expectation in both sides, we get

0-i[m]piES,Zi~𝒟iw^(global )(S),Zi-w(1),Zi+i[m]pinijniES,zi,jw^(global )S(i,j),zj(i)-w^(global )(S),zj(i)=- i[m]piES,Zi~𝒟iw^(global )(S),Zi-w(i),Zi-i[m]piEZi~𝒟iw(i),Zi-w(1),Zi+i[m]pinijniES,zi,jw^(global )S(i,j),zj(i)-w^(global )(S),zj(i).

Noting that w(i) is the argmin of EZi~𝒟i,Zi and invoking the β-smoothness assumption, we get

ESAERpw^(global )βi[m]piw(1)-w(i)2+i[m]pinijniES,zi,jw^(global )S(i,j),zj(i)-w^(global )(S),zj(i)2βw(1)-wp(global )2+2βi[m]piw(i)-wp(global )+i[m]pinijniES,zi,jw^(global )S(i,j),zj(i)-w^(global )(S),zj(i).

Taking a weighted average, we arrive at

ESAERpw^(global) 4βR2+i[m]pinijniES,zi,jw^(global )S(i,j),zj(i)-w^(global )(S),zj(i) (44)

To bound the second term in the right-hand side above, we bound the federated stability of w^(global). Without loss of generality we consider the first client. By μ-strongly convexity of L1, for any j1n1 we have

μ2w^(global) (S)-w^(global) S1,j12i[m]piLiw^(global) S1,j1,Si-Liw^(global) (S),Si=i1piLiw^(global )S1,j1,Si+p1L1w^(global) S1,j1,S1j1-i1piLiw^(global) (S),Si+p1L1w^(global) (S),S1j1+p1L1w^(global )S1,j1,S1-L1w^(global) S1,j1,S1j1+p1L1w^(global) (S),S1j1-L1w^(global) (S),S1p1L1w^(global)S1,j1,S1-L1w^(global)S1,j1,S1j1+p1L1w^(global)(S),S1j1-L1w^(global)(S),S1=p1n1w^(global)S1,j1,zj1(1)-w^(global)(S),zj1(1)+p1n1w^(global)(S),z1,j1-w^(global)S1,j1,z1,j1, (45)

where the second inequality is because w^(global)S1,j1 minimizes L1,S1j1+i1niLi,Si. By an identical argument as in the proof of Lemma 26, we have

w^(global)S1,j1,zj1(1)-w^(global)(S),zj1(1)2βw^(global)(S)-w^(global)S1,j1+β2w^(global)(S)-w^(global)S1,j12 (46)

The same bound also holds for w^(global)(S),z1,j1-w^(global)S1,j1,z1,j1. Plugging these two bounds to (45) and rearranging terms, we get

μ2-βp1n1w^(global )(S)-w^(global )S1,j122βp1n1.

Since n14βp1/μ, we in fact have

μ4w^(global) (S)-w^(global) S1,j122βp1n1.

Plugging the above display back to (46), we arrive at

w^(global) S1,j1,zj1(1)-w^(global) S,zj1116βp1μn11+4βp1μn132βp1μn1,

where the last inequality is again by n14βp1/μ. The desired result follows by plugging the above inequality back to (44). ■

Proof [Proof of (10)] Without loss of generality we consider the first client. Since w(1) is the minimizer of EZ1~𝒟1,Z1, by β-smoothness we have

EZ1~𝒟1w^(global),Z1-w(1),Z1 βEZ1~𝒟1w^global-w12βEZ1~𝒟1w^global-wpglobal2+βR2, (47)

where the last inequality is by Part (b) of Assumption B. By optimality of w^(global) and the strong convexity of Lis, we have

i[m]piLiwp(global ),Si,w^(global )-wp(global )+μ2w^(global )-wp(global )20.

If w^(global)-wp(global)=0 then we are done. Otherwise, the above display gives

w^(global) -wp(global )2μi[m]piLiwp(global ),Si2μi[m]piLiw(i),Si+i[m]piLiwp(global ),Si-Liw(i),Si2μi[m]piLiw(i),Si+βi[m]piwp(global )-w(i)2μi[m]piLiw(i),Si+βR.

Thus, we get

w^(global )-wp(global )28μ2i[m]piLiw(i),Si2+β2R2.

Taking expectation with respect to the sample S at both sides, we have

ESw^(global )-wp(global )2 1μ2ESimpiLiwi,Si-ESLiwi,Si2+β2R2μ21μ2impi2σ2ni+β2R2μ2.

Plugging the above inequality to (47) gives the desired result. ■

C.2. Proof of Proposition 10

Proof [Proof of (21)] By the definitions of the AER and OPT, we have

AERp=OPT +λ2i[m]piw˜(global )(S)-w˜(i)(S)2-w^(global )(S)-w^(i)(S)2 +i[m]piEZi~𝒟iw^(i)(S),Zi-Liw^(i)(S),Si +i[m]piLiw˜(i)(S),Si-EZi~𝒟iw(i),Zi.

By the basic inequality (23), we can bound the AER by

 AERpOPT+λ2i[m]piwp(global )-w(i)2+i[m]piEZi~𝒟iw^(i)(S),Zi-Liw^(i)(S),Si+i[m]piLiw(i),Si-EZi~𝒟iw(i),Zi.

Now, invoking federated stability, we can further bound the AER by

AERpOPT +λ2i[m]piwp(global )-w(i)2+2i[m]piγi +i[m]pi1nijniEzi,j~𝒟iEZi~𝒟iw^(i)(S(i,j)),Zi-w^(i)S(i,j),zj(i) +i[m]piLiw(i),Si-EZi~𝒟iw(i),Zi,

where S(i,j) is the dataset formed by replacing zj(i) with a new sample zi,j, and here we are choosing zi,j to be an independent sample from 𝒟i. Note that the last two terms of the above display have mean zero under the randomness of the algorithm 𝒜, the dataset S, and zi,j:i[m],jni. Thus, the desired result follows by taking expectation in both sides. ■

Proof [Proof of (22)] Without loss of generality we consider the first client. By definitions of IER1 and OPT, we have

p1IER1=OPT +i[m]piLiw˜(i)(S),Si+λ2w˜(global )(S)-w˜(i)(S)2 -i[m]piLiw^(i)(S),Si+λ2w^(global )(S)-w^(i)(S)2 +p1EZ1~𝒟1w^(1)(S),Z1-w(1),Z1.

Invoking the basic inequality (24), with some algebra, we arrive at

p1IER1 OPT+p1λ2w^(global )(S)-w(1)2+p1EZ1~𝒟1w^(1)(S),Z1-L1w^(1)(S),S1 +p1L1w(1),S1-EZ1~𝒟1w(1),Z1.

Now, invoking federated stability for the first client, we can bound its IER by

p1IER1OPT +p1λ2w^global S-w12+2p1γ1 +p1n1jn1Ez1,j~𝒟1EZ1~𝒟1w^1S1,j,Z1-w^1S1,j,zj1+p1L1w1,S1-EZ1~𝒟1w1,Z1,

where we recall that S(1,j) is the dataset formed by replacing zj(1) with a new sample z1,j, and here we are choosing z1,j to be an independent sample from 𝒟1. We finish the proof by taking the expectation with respect to 𝒜,S,z1,j:jn1 at both sides. ■

C.3. Proof of Theorem 11

In this proof, we let 𝒜=w^(global),w^(i) be a generic algorithm that tries to minimize (20). For notiontional simplicity, we use anβbn (resp. anβbn) to denote that anCβbn (resp. anCβbn) for large n, where Cβ has explicit dependence on a parameter β.

Recall that w˜(global),w˜(i) is the global minimizer of (20), and recall the notations in (35)–(37). We start by bounding the federated stability of approximate minimizers of (20). We need the following definition.

Definition 20 (Approximate minimizers) We say an algorithm 𝒜=w^(global),w^(i)1m produces an ε(global),ε(i)1m-minimizer of the objective function (20) on the dataset S if the following two conditions hold:

  1. there exist a positive constant ε(global) such that w^(global)-w˜(global)ε(global);

  2. for any i[m], there exist a positive constant ε(i) such that w^(i)-ProxLi/λw^(global)ε(i).

The stability bound is as follows.

Proposition 21 (Federated stability of approximate minimizers) Let Assumption A(b) holds, and consider an algorithm 𝒜=w^(global),w^(i)1m that produces an ε(global),ε(i)1m-minimizer of the objective function (20) on the dataset S. Assume in addition that

ni4βμ, piλμ16 im. (48)

Then 𝒜 has federated stability

γi160βniμ+λ+Erri,

Where

Erri22β4ε(global) β+λμ+λ+3λμ+ε(i)β+λμ+λ+16piλμ +8β216ε(global) 2β+λμ+λ+3λμ2+ε(i)2β+λμ+λ+16piλμ2

is the error term due to not exactly minimizing the soft weight sharing objective (20).

Proof See Appendix C.3.1. ■

Taking the optimization error into account, we have the following result.

Proposition 22 (Federated stability of 𝒜FP) Let Assumption A(a, b) and (48) hold. Run 𝒜FP with hyperparameters chosen as in Lemma 14 and 15. Then, as long as

TC1λ2(λ1)2mp2ni2, KTC2λ2(λ1)2pi2ni2 im, (49)

the algorithm 𝒜FP have expected federated stability

E𝒜FPγiCβniμ+λ,

where C1, C2 are two constants only depending on μ, β, , D, and C is an absolute constant.

Proof By Proposition 21, it suffices to upper bound the error term Erri by a constant multiple of βni(μ+λ). Invoking Lemma 15, we have

E𝒜FPε(global )2E𝒜FPεglobal 212(λ+μ)2mp2β2D22λλ2D2λ2μ2T+1.

This gives

E𝒜FPε(global)2μ,β,μ,D(λ1)2mp21λλ2λ2T+1mp2T+1, (50)

where we recall that anμ,β,μ,Dbn means anCbn for a constant C that only depending on μ,β,μ,D, and the last inequality follows from (λ1)21λλ2λ2 regardless λ1 or λ1. Meanwhile, by Lemma 14, we have

E𝒜FPε(i)2E𝒜FPεi28β2D2μ2KT+1β,D1KT+1.

Recalling the definition of Erri, we have

E𝒜FPErriμ,β,μ,DλE𝒜FPεglobal +λpiE𝒜FPεi+λ2E𝒜FPεglobal 2+pi2λ2E𝒜FPεi2μ,β,μ,DλmpT+1+λpiKT+1+λ2mp2T+1+pi2λ2KT+1.

Thus, it suffices to require

T μ,β,μ,Dλmpni(μ+λ), T μ,β,μ,Dλ2mp2ni(μ+λ),KT μ,β,μ,Dλpini(μ+λ), KT μ,β,μ,Dpi2λ2ni(μ+λ),

which is equivalent to

Tμ,β,μ,Dmaxλ2mp2ni2(λ1)2,λ2mp2ni(λ1)=λ2mp2ni2(λ1)2
KTμ,β,μ,Dmaxλ2pi2ni2(μ1)2,pi2λ2ni(μ1)=λ2pi2ni2(λ1)2,

which is exactly (49). ■

Combining the above proposition with Proposition 10, we get the following result.

Proposition 23 (λ-dependent bound on the AER) Let Assumption A (a, b) and (48) hold. Run 𝒜FP with hyperparameters chosen as in Lemma 14 and 15. Then, as long as

T C1λ(λ1)mp2s[m]ps/ns-1λ(λ1)ni2,KT C2(λ+1)2s[m]ps/ns-1λ2pi2ni2, (51)

for any i[m], the algorithm 𝒜FP satisfies

E𝒜FP,SAERp𝒜FPCβμ+λimpini+λ2impiwpglobal -wi2,

where C1, C2 are two constants only depending on μ,β,,D, and C is an absolute constant.

Proof In view of Propositions 10 and 22, it suffices to set T, KT such that (1) (49) is satisfied; and (2) E𝒜FPOPT is upper bounded by a constant multiple of βμ+λi[m]pini. To achive the second goal, note that by Proposition 16, the optimization error is bounded by

E𝒜FPOPTμ,β,μ,Dλ1KT+1+λmp2T+1. (52)

Thus, it suffices to require Tμ,β,μ,Dλ(λ1)mp2i[m]pi/ni and KTμ,β,μ,D(λ1)2i[m]pi/ni. This requirement, combined with (49), is exactly (51). ■

With the above proposition at hand, we are ready to give our proof of Theorem 11.

Proof [Proof of Theorem 11] We first define the following three events:

ARi[m]pini,Bi[m]pi2/nii[m]pi/niRi[m]pini,CRi[m]pi2/nii[m]pi/ni.

We then choose λ to be

λ=μ16R2i[m]pini1A+μ16CpRi[m]pini1B+μ16Cpi[m]pi2/nii[m]pini1C.

We now consider the three events separately.

  1. If A holds, then piλ=piμ16R2i[m]piμipiμ16μ16. Thus we can invoke Proposition 23 to get
    E𝒜FP,SARpCβμ+μ32impiniright-hand side of (28).
  2. If B holds, then piλ=piμ16CpRi[m]pinipmaxμN16Cpi[m]pi2/nii[m]piniμ16, where the last inequality is by the definition of Cp. Hence, by Proposition 23, we have
    E𝒜FP,SAERp16CCpβμ+μ32CpRi[m]piniright-hand side of (28).
  3. If C holds, then piλ=piμ16Cpi[m]pi2/nii[m]piniμ16, and thus Proposition 23 gives
    E𝒜FP,SAERp16CCpβμ+μ32Cpi[m]pi2niright-hand side of (28).

The desired result follows by combining the above three cases together. ■

C.3.1. Proof of Proposition 21: Stability of Approximate Minimizers

We first present two lemmas, from which Proposition 21 will follow.

Lemma 24 (Federated stability of approximate minimizers, Part I) Let Assumption A(b) holds, and consider an algorithm 𝒜=w^(global),w^(i)1m that satisfies the following conditions:

  1. there exist positive constants δ(global), ζ(global) such that
    i[m]piFiw^(global),Siδ(global)+i[m]piFiw˜(global),Si, (53)
    i[m]piFiw^(global),Siζ(global). (54)
  2. for any i[m], there exist positive constants δ(i),ζ(i),ε(i)i=1m such that
    Liw^(i),Si+λ2w^(global)-w^(i)2δ(i)+Fiw^(global),Si, (55)
    Liw^(i),Si+λw^(i)-w^(global)ζi, (56)
    w^(i)-ProxLi/λw^(global)εi. (57)

Assume in addition that (48) holds. Then 𝒜 has federated stability

γi160βniμ+λ+2βλ,i+βλ,i2, (58)

where

λ,i8ζ(i)μ+λ+8δ(i)μ+λ+8μ-12ζ(global)+4piλε(i)+2μλδ(global)μ+λ (59)

is the error term due to not exactly minimizing (20).

Lemma 25 (Federated stability of approximate minimizers, Part II) Let Assumption A(b) holds and consider an algorithm 𝒜=w^(global),w^(i)1m that produces an ε(global),ε(i)1m-minimizer in the sense of Definition 20. Then 𝒜 also satisfies Equations (53)(57) with

δ(global )=λ2ε(global ), ζ(global )=λε(global ), δ(i)=β+λ2ε(i), ζ(i)=(β+λ)ε(i).

Proof These correspondences are consequences of λ-smoothness of Fi and (β+λ)-smoothness of Li,Si+λ2w^(global)-2. We omit the details. ■

With the above two lemmas at hand, the proof of Proposition 21 is purely computational:

Proof [Proof of Proposition 21 given Lemma 24 and 25] Invoking 25, the error term λ,i defined in Equation 59 can be bounded above by

λ,i 8(β+λ)μ+λε(global )+4(β+λ)μ+λε(i)+8μ2λε(global )+4piλε(i)+μλ2μ+λ=8ε(global )β+λμ+λ+2λμ+λμ(μ+λ)+2ε(i)β+λμ+λ+16piλμ8ε(global )β+λμ+λ+3λμ+2ε(i)β+λμ+λ+16piλμ.

This gives

λ,i2128ε(global )2β+λμ+λ+3λμ2+8ε(i)2β+λμ+λ+16piλμ2.

Plugging the above two displays to (58) gives the desired result. ■

We now present our proof of Lemma 24. We start by stating and proving several useful lemmas.

Lemma 26 (From loss stability to parameter stability)LetAssumption A(b)holds. Then the algorithm𝒜=w^(global),w^(i)has federated stability
γi2βw^iS-w^iSi,ji+β2w^iS-w^iSi,ji2.

Proof This lemma has implicitly appeared in the proofs of many stability-based generalization bounds (see, e.g., Section 13.3.2 of Shalev-Shwartz and Ben-David (2014)), and we provide a proof for completeness. By β-smoothness, for an arbitrary z𝒵 we have

w^(i)(S),z-w^(i)Si,ji,zw^(i)Si,ji,z,w^(i)(S)-w^(i)Si,ji+β2w^(i)(S)-w^(i)Si,ji2w^(i)Si,ji,zw^(i)(S)-w^(i)Si,ji+β2w^(i)(S)-w^(i)Si,ji22βw^(i)Si,ji,z-minw(i)𝒲w(i),zw^(i)(S)-w^(i)Si,ji+β2w^(i)(S)-w^(i)Si,ji22βw^(i)(S)-w^(i)Si,ji+β2w^(i)(S)-w^(i)Si,ji2,

where the last inequality follows from boundedness of . By a nearly identical argument, the above upper bound also holds for -w^(i)(S),z+w^(i)Si,ji,z, and the desired result follows. ■

Lemma 27 (Local stability implies global stability) Assume Assumption A(b) holds and consider an algorithm 𝒜=w^(global),w^(i)1m that satisfies Equations (53), (54) and (57). Then for any i[m], jini, we have

w^(global)Si,ji-w^(global)(S) λ+μλμ2ζglobal+2λμδglobalλ+μ+4piλεi+2piλw^iSi,j1-w^iS. (60)

Proof Without loss of generality we consider the first client. Let μF be the strongly convex constant of ipiFi, which, by Lemma 17, is equal to ipiμλμ+λ=λμ/(λ+μ). Now, by strong convexity, we have

μF2w^(global )(S)-w^(global) S1,j12i[m]piFiw^(global )S1,j1,Si-Fiw^(global) (S),Si+i[m]piFiw^(global) (S),Si,w^(global) S1,j1-w^(global) (S)(54)i[m]piFiw^(global) S1,j1,Si-Fiw^(global) (S),Si+ζ(global )w^(global )S1,j1-w^(global )(S)=p1F1w^(global) S1,j1,S1j1+i1piFiw^(global) S1,j1,Si-p1F1w^(global) S,S1j1+i1piFiw^(global) S,Si+p1F1w^(global) S1,j1,S1-F1w^(global) S1,j1,S1j1+F1w^(global) (S),S1j1-F1w^(global) (S),S1+ζ(global) w^(global) S1,j1-w^(global) (S)(53)δ(global )+ζ(global) )w^(global) S1,j1-w^(global) (S)+p1F1w^(global) S1,j1,S1-F1w^(global) (S),S1 +F1w^(global) (S),S1j1-F1w^(global) )S1,j1,S1j1.

Since F1 is λ-smooth by Lemma 17, we can proceed by

μF2w^(global) (S)-w^(global) S1,j12δ(global) +ζ(global) w^(global) S1,j1-w^(global) (S)+p1λw^(global) S1,j1-w^(global) (S)2+p1F1w^(global) (S),S1-F1w^(global) S1,j1,S1j1,w^(global) S1,j1-w^(global) (S).

Since F1w(global),S1=λw(global)-ProxL1/λw(global),S1, with some algebra, the right-hand side above is in fact equal to

δ(global) +ζ(global) w^(global) S1,j1-w^(global) (S)+p1λw^(1)(S)-ProxL1/λw^(global )(S),S1,w^(global )S1,j1-w^(global )(S)+p1λProxL1/λw^(global )S1,j1,S1j1-w^(1)S1,j1,w^(global )S1,j1-w^(global )(S)+p1λw^(1)S1,j1-w^(1)(S),w^(global) S1,j1-w^(global) (S) (57) δ(global) +ζ(global) +2p1λw^(global) S1,j1-w^(global) (S)+p1λw^(i)S1,j1-w^(i)(S)w^(global )S1,j1-w^(global) (S).

The above bound gives a quadratic inequality: if we let sGw^(global)S1,j1-w^(global)(S) and s1w^(i)S1,j1-w^(i)(S), then the above bound can be written as

μF2sG2-ζglobal +2p1λε1+p1λs1sG-δglobal 0.

Solving this inequality gives

sG 1μFζglobal+2p1λε1+p1λs1+ζ(global)+2p1λε1+p1λs12+2μFδ(global)1μF2ζ(global)+4p1λε1+2p1λs1+2μFδglobal,

which is exactly (60). ■

Lemma 28 (Parameter stability) Under the same assumptions as Proposition 21, for any i[m], jini, we have

w^(i)Si,ji-w^(i)(S)162βniμ+λ+λ,i.

Proof Without loss of generality we consider the first client. Since L1,S1+λ2w^(global)(S)-2 is (μ+λ)-strongly convex, we have

12(μ+λ)w^(1)(S)-w^(1)S1,j12L1w^1S1,j1,S1+λ2w^global S-w^1S1,j12-L1w^1S,S1+λ2w^global S-w^1S2+L1w^1S,S1+λw^1S-w^global S,w^1S1,j1-w^1S(56)L1w^(1)S1,j1,S1j1+λ2w^(global )S1,j1-w^(1)S1,j12-L1w^(1)(S),S1j1+λ2w^(global )S1,j1-w^(1)(S)2-1n1w^(1)S1,j1,z1,j1+1n1w^(1)S1,j1,zj1(1)+1nw^(1)(S),z1,j1-1n1w^(S),zj1(1)-λ2w^(global )S1,j1-w^(1)S1,j12+λ2w^(global )(S)-w^(1)S1,j12+λ2w^(global )S1,j1-w^(1)(S)2-λ2w^(global )(S)-w^(1)(S)2+ζ(1)w^(1)S1,j1-w^(1)(S)(55)δ(1)+ζ(1)w^(1)S1,j1-w^(1)(S)+λw^(global )(S)-w^(global )S1,j1,w^(S)-w^(1)S1,j1+1n1w^(1)(S),z1,j1-w^(1)S1,j1,z1,j1+w^(1)S1,j1,zj1(1)-w^(1)(S),zj1(1)δ(1) +ζ(1)w^(1)S1,j1-w^(1)(S)+2n12βw^(1)(S)-w^(1)S1,j1+β2w^(1)(S)-w^(1)S1,j12+λ+μμ2ζ(global )+2λμδ (global) λ+μ+4piλε(i)+2piλw^(i)Si,j1-w^(i)(S)×w^(1)(S)-w^(1)S1,j1,

where the last inequality is by Lemma 26 and Lemma 27. Denoting s1w^(1)(S)-w^(1)S1,j1, the above inequality can be written as

Cλ,1s12-22βn1+ζ(1)+λ+μμ2ζ(global)+4p1λε(1)+2λμδ(global)λ+μs1-δ(1)0 (61)

where

Cλ,112μ+λ-βn1-2p1λλ+μμ.

By (48), we have

Cλ,1μ+λ2-μ4-2p1λ(λ+μ)μλ+μ4-2p1λ(λ+μ)μ=λ+μ41-8p1λμλ+μ8.

In particular, Cλ,1>0, and thus we can solve the quadratic inequality (61) (similar to the proof of Lemma 27) to get

s122βCλ,1n1+ζ1+λ+μμ2ζglobal +4p1λε1+2λμδglobal λ+μCλ,1+δ1Cλ,1.

Plugging in Cλ,1(λ+μ)/8 to the above inequality gives the desired result. ■

We are finally ready to present a proof of Lemma 24:

Proof [Proof of Lemma 24] Invoking Lemma 26, we have

γi2β162βniμ+λ+λ,i+β2162βniμ+λ+λ,i232βniμ+λ1+βniμ+λ+2βλ,i+βλ,i2,

where in the last line we have used (a+b)22a2+2b2. We finish the proof by noting that βni(μ+λ)βniμ4, where the last inequality is by (48). ■

C.4. Proof of Theorem 12

Compared to the proof of Theorem 11, we need to additionally control the estimation error of the global model.

Proposition 29 (Estimation error of the global model) Let Assumptions A(b) and B(b) hold. Then

ESw˜(global) -wp(global )248β2σ2μ2λ2impini2+48β2R2μ2+12(μ+λ)2σ2μ2λ2impi2ni.

Proof See Appendix C.4.1

With the above proposition, the following result is a counterpart of Proposition 23.

Proposition 30 (λ-dependent bound on the IER) Let Assumptions A(a, b), B(b) and Equation (48) hold. Run 𝒜FP with hyperparameters chosen as in Lemma 14 and 15. Then, for any i[m], as long as

T C1λ(λ1)mp2nipi-1λλ1ni,  KT C2(λ+1)2nipi-1λ2pi2ni, (62)

the algorithm 𝒜FP satisfies both

E𝒜FP,SIERi𝒜FPCλniβ+σ2β2niμ2i[m]pini2+σ2nii[m]pi2ni+Cλ1+β2μ2R2+σ2μ2i[m]pi2ni, (63)

and

E𝒜FP,SIERi𝒜FPCβμni+λD2, (64)

where C1, C2 are two constants only depending on μ,β,,D, and C is an absolute constant.

Proof Without loss of generality we consider the first client. Our assumptions allow us to invoke Propositions 10 and 22 to get

E𝒜FP,SIER1 E𝒜FP,SOPTp1+3λ2ε(global)2+3λ2w˜(global)-wp(global)2+3λ2R2+320βn1(μ+λ)

We first show that the expected value of OPT/p1 and λε(global)2 are both bounded above by a constant multiple of βn1(μ+λ). Indeed, by the estimates we have established in Equations (50) and (52), it suffices to require

Tμ,β,μ,Dλλ1m||P||2n1,

and

KTμ,β,μ,D(λ1)2n1p1, Tμ,β,μ,Dλλ1mp2n1p1,

respectively. And the above two displays, combined with (49), is exactly (62). (64) then follows from the compactness of 𝒲. To prove (63), we invoke Proposition 29 to get

E𝒜FP,SIER1βn1μ+λ+λ1+β2μ2R2+β2σ2μ2λimpini2+(μ+λ)2σ2μ2λimpi2niβn1λ+λ1+β2μ2R2+β2σ2μ2λimpini2+σ2λ+λσ2μ2impi2ni,

and (64) follows by rearranging terms. ■

We now present our proof of Theorem 12

Proof [Proof of Theorem 12| Without loss of generality we consider the first client. Since all nis are of the same order, it suffices to show

EIERi𝒜FPμ+μ-1β+σ2β2+β2+σ2μ2+μD2RN/m1N/m+mN (65)

We define the following two events:

ARmN, BAc=R<mN,

and we set

λ=cAmD2N1A+cBmR2N+11B,

where cA, cB are two constants to be specified later. We consider two cases:

  1. If A holds, then from (64) we have E𝒜FP,SIER1βμ+cA1N/m, provided λpmaxμ/16. Note that λpmaxcAD2NcA/D2. So we can choose λμD2, which gives
    E𝒜FP,SIER1βμ+μD21N/mright-hand side of (65).
  2. If B holds, and if λpmaxμ/16 holds, then from (63) we have
     E𝒜FP,SIER1 1λN/mβ+σ2β2μ2+σ2+λ1+β2μ2R2+σ2μ2Nβ+σ2β2+β2+σ2μ21λN/m+λR2+N-1=β+σ2β2+β2+σ2μ2R2N+1cBN/m+cBNmR2N+1β+σ2β2+β2+σ2μ2RcBN/m+1cBN/m+cBmRN+cBmN=β+σ2β2+β2+σ2μ2cB+cB-1RN/m+mN.

Note that pmaxλcBpmaxmcB/mcB. So to satisfy pmaxλμ/16, we can choose cBμ. This gives

E𝒜FP,SIER1 β+σ2β2+β2+σ2μ2μ+μ-1RNm+mNright-hand side of (65).

The desired result follows by combining the above two cases together. ■

C.4.1. Proof of Proposition 29: Estimation Error of the Global Model

We begin by proving a useful lemma.

Lemma 31 (Estimating w(i) given the knowledge of wp(global)) Let Assumption A(b) hold. Then for any i[m], we have

w(i)-ProxLi/λwp(global )2μ+λLiw(i),Si+λw(i)-wp(global ).

Proof This follows from an adaptation of the arguments in Theorem 7 of Foster et al. (2019). By strong convexity, we have

Liw(i),Si+λw(i)-wp(global ),ProxLi/λwp(global )-w(i)+μ+λ2w(i)-ProxLi/λwp(global )2Liwi,Si+λ2wglobal -wi2-LiProxLiλwpglobal ,Si-λ2wpglobal -ProxLiλwpglobal 20.

If w(i)-ProxLi/λwp(global)=0 we are done. Otherwise, Cauchy-Schwartz inequality applied to the above display gives the desired result. ■

Now, since i[m]piFi is μF=μλ/(μ+λ)-strongly convex, we have

i[m]piFiwp(global ),w˜(global )-wp(global )+μF2w˜(global )-wp(global )2i[m]piFiw˜(global) )-i[m]piFiwp(global )0

If w˜(global)-wp(global)=0 we are done. Otherwise, by Cauchy-Schwartz inequality, we get

w˜global -wpglobal 2μFimpiFiwpglobal =2μFimpiLiProxLiλwpglobal ,Si2μFimpiLiProxLiλwpglobal ,Si-Liwi,Si+2μFimpiLiwi,Si*2βμFimpiProxLiλwpglobal -wi+2μFimpiLiwi,Si**4βμFμ+λimpiLiwi,Si+λwi-wpglobal +2μFimpiLiwi,Si4βμλi[m]piLiw(i),Si+4βRμ+2(μ+λ)μλi[m]piLiw(i),Si,

where (*) is by smoothness of Li and (**) is by Lemma 31. Thus, we have

w˜(global)-wp(global)2 48β2μ2λ2i[m]piLiw(i),Si2+48β2R2μ2+12(μ+λ)2μ2λ2i[m]piLiw(i),Si2 (66)

Note that

i[m]piLiw(i),Si2=i[m]pi2Liw(i),Si2+ispipsLiw(i),SiLsw(s),Ss.

Taking expectation at both sides, we arrive at

ESi[m]piLiw(i),Si2impi2σ2ni+ispipsσiσsnins=σ2impini2.

Meanwhile, we have

ESi[m]piLiw(i),Si2impi2σ2ni.

The desired result follows by plugging the previous two displays to (66).

D. Details on Experiments

In each round (among 100 rounds) of simulation, we first generate wR100 with i.i.d. standard Gaussian entries, and we set each local model w(i)=w+Rvi, where viR100 is a random unit vector that has negative correlation with w and we vary R from 0 to 20. The dataset for the i-th client is then generated by a logistic regression model. We apply FedAvg (Algorithm 1), PurelocalTraining, and FedAvg followed by fine tuning, as well as FedProx (Algorithm 2) to this collection of datasets.

For FedAvg, we assume full participation (i.e., 𝒞t=[m]) and we set the number of communication rounds T=20 and global step size ηt=0.8. In its local training stage, we run SGD for 5 epochs with step size 0.2. For PureLocalTraining, we run SGD with step size 0.2 for 20 ⋅ 5=100 epochs. For the fine tuning strategy, we first run FedAvg (with the same hyperparameter as the previous case) and then for each client run SGD for 15 epochs with step size 0.2. For FedProx, we again assume full participation, and we set the number of communication rounds T=20, global step size ηtglobal=0.8, local rounds Kt=5, and local step size ηt,k(i)=0.2. In all the experiments, the batch size is set to 16.

Footnotes

1.

In fact, Algorithm 2 is not exactly the same as the original FedProx algorithm introduced in Li et al. (2018). But since both algorithms share the idea of imposing regularization, we still call Algorithm 2 FedProx for conceptual simplicity.

2.

The theories developed by Hanzely et al. (2020) can be useful for such an analysis.

References

  1. Agarwal Alekh, Bartlett Peter L, Ravikumar Pradeep, and Wainwright Martin J. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012. [Google Scholar]
  2. Arivazhagan Manoj Ghuhan, Aggarwal Vinay, Singh Aaditya Kumar, and Choudhary Sunav. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019. [Google Scholar]
  3. Assouad Patrice. Deux remarques sur l′estimation. Comptes rendus des séances de l′Académie des sciences. Série 1, Mathématique, 296(23):1021–1024, 1983. [Google Scholar]
  4. Bai Yu, Chen Minshuo, Zhou Pan, Zhao Tuo, Jason D Lee Sham Kakade, Wang Huan, and Xiong Caiming. How important is the train-validation split in meta-learning? arXiv preprint arXiv:2010.05843, 2020. [Google Scholar]
  5. Balcan Maria-Florina, Khodak Mikhail, and Talwalkar Ameet. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424–433. PMLR, 2019. [Google Scholar]
  6. Baxter Jonathan. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000. [Google Scholar]
  7. Bayoumi Ahmed Khaled Ragab, Mishchenko Konstantin, and Richtarik Peter. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529, 2020. [Google Scholar]
  8. Ben-David Shai and Borbely Reba Schuller. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine learning, 73(3):273–287, 2008. [Google Scholar]
  9. Ben-David Shai, Blitzer John, Crammer Koby, and Pereira Fernando. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137–144, 2006. [Google Scholar]
  10. Ben-David Shai, Blitzer John, Crammer Koby, Kulesza Alex, Pereira Fernando, and Vaughan Jennifer Wortman. A theory of learning from different domains. Machine learning, 79(1–2):151–175, 2010. [Google Scholar]
  11. Bonawitz Keith, Eichner Hubert, Grieskamp Wolfgang, Huba Dzmitry, Ingerman Alex, Ivanov Vladimir, Kiddon Chloe, Konečnỳ Jakub, Mazzocchi Stefano, and McMahan H Brendan. Towards federated learning at scale: System design. Conference on Machine Learning and Systems, 2019. [Google Scholar]
  12. Bousquet Olivier and Elisseeff André. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002. [Google Scholar]
  13. Cai T Tony and Wei Hongji. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. arXiv preprint arXiv:1906.02903, 2019. [Google Scholar]
  14. Caruana Rich. Multitask learning. Machine learning, 28(1):41–75, 1997. [Google Scholar]
  15. Chen Shuxiao, Liu Sifan, and Ma Zongming. Global and individualized community detection in inhomogeneous multilayer networks. arXiv preprint arXiv:2012.00933, 2020. [Google Scholar]
  16. Denevi Giulia, Ciliberto Carlo, Stamos Dimitris, and Pontil Massimiliano. Learning to learn around a common mean. In Advances in Neural Information Processing Systems, pages 10169–10179, 2018. [Google Scholar]
  17. Denevi Giulia, Ciliberto Carlo, Grazzi Riccardo, and Pontil Massimiliano. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575, 2019. [Google Scholar]
  18. Deng Yuyang, Kamani Mohammad Mahdi, and Mahdavi Mehrdad. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020. [Google Scholar]
  19. Dinh Canh T, Tran Nguyen H, and Nguyen Tuan Dung. Personalized federated learning with moreau envelopes. arXiv preprint arXiv:2006.08848, 2020. [Google Scholar]
  20. Du Simon S, Hu Wei, Kakade Sham M, Lee Jason D, and Lei Qi. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020. [Google Scholar]
  21. Duchi John. Lecture notes for statistics 311/electrical engineering 377. http://web.stanford.edu/class/stats311/lecture-notes.pdf, 2019. Accessed: 2020-10-03.
  22. Evgeniou Theodoros and Pontil Massimiliano. Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117, 2004. [Google Scholar]
  23. Fallah Alireza, Mokhtari Aryan, and Ozdaglar Asuman. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020. [Google Scholar]
  24. Finn Chelsea, Abbeel Pieter, and Levine Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. [Google Scholar]
  25. Foster Dylan J., Sekhari Ayush, Shamir Ohad, Srebro Nathan, Sridharan Karthik, and Woodworth Blake. The complexity of making the gradient small in stochastic convex optimization, 2019.
  26. Haddadpour Farzin and Mahdavi Mehrdad. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019. [Google Scholar]
  27. Hanneke Steve and Kpotufe Samory. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, pages 9871–9881, 2019. [Google Scholar]
  28. Hanneke Steve and Kpotufe Samory. A no-free-lunch theorem for multitask learning. arXiv preprint arXiv:2006.15785, 2020. [Google Scholar]
  29. Hanzely Filip and Richtárik Peter. Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516, 2020. [Google Scholar]
  30. Hanzely Filip, Hanzely Slavomír, Horváth Samuel, and Richtarik Peter. Lower bounds and optimal algorithms for personalized federated learning. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
  31. Jiang Yihan, Konečnỳ Jakub, Rush Keith, and Kannan Sreeram. Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019. [Google Scholar]
  32. Jose Sharu Theresa and Simeone Osvaldo. An information-theoretic analysis of the impact of task similarity on meta-learning. arXiv preprint arXiv:2101.08390, 2021. [Google Scholar]
  33. Kairouz Peter, McMahan H Brendan, Avent Brendan, Bellet Aurélien, Bennis Mehdi, Bhagoji Arjun Nitin, Bonawitz Keith, Charles Zachary, Cormode Graham, and Cummings Rachel. Advances and open problems in federated learning. arXiv preprint arXiv:1912.0497r, 2019. [Google Scholar]
  34. Kalan Seyed Mohammadreza Mousavi, Fabian Zalan, Avestimehr A Salman, and Soltanolkotabi Mahdi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. arXiv preprint arXiv:2006.10581, 2020. [Google Scholar]
  35. Khaled Ahmed, Mishchenko Konstantin, and Richtárik Peter. First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019. [Google Scholar]
  36. Khodak Mikhail, Balcan Maria-Florina F, and Talwalkar Ameet S. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5917–5928, 2019. [Google Scholar]
  37. Konobeev Mikhail, Kuzborskij Ilja, and Szepesvári Csaba. On optimality of meta-learning in fixed-design regression with weighted biased regularization. arXiv preprint arXiv:2011.00344, 2020. [Google Scholar]
  38. Kulkarni Viraj, Kulkarni Milind, and Pant Aniruddha. Survey of personalization techniques for federated learning. arXiv preprint arXiv:2003.08673, 2020. [Google Scholar]
  39. Lemaréchal Claude and Sagastizábal Claudia. Practical aspects of the moreau-yosida regularization: Theoretical preliminaries. SIAM Journal on Optimization, 7(2):367–385, 1997. [Google Scholar]
  40. Li Daliang and Wang Junpu. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019. [Google Scholar]
  41. Li Sai, Cai T Tony, and Li Hongzhe. Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality. arXiv preprint arXiv:2006.10593, 2020a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Li Tian, Anit Kumar Sahu Manzil Zaheer, Sanjabi Maziar, Talwalkar Ameet, and Smith Virginia. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018. [Google Scholar]
  43. Li Xiang, Huang Kaixuan, Yang Wenhao, Wang Shusen, and Zhang Zhihua. On the convergence of fedavg on non-iid data, 2020b.
  44. Li Zhize and Richtárik Peter. A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013, 2020. [Google Scholar]
  45. Lucas James, Ren Mengye, Kameni Irene, Pitassi Toniann, and Zemel Richard. Theoretical bounds on estimation error for meta-learning. arXiv preprint arXiv:2010.07140, 2020. [Google Scholar]
  46. Malinovsky Grigory, Kovalev Dmitry, Gasanov Elnur, Condat Laurent, and Richtarik Peter. From local sgd to local fixed point methods for federated learning. arXiv preprint arXiv:2004.01442, 2020. [Google Scholar]
  47. Mangasarian OL and Solodov MV. Backpropagation convergence via deterministic nonmonotone perturbed minimization. In Proceedings of the 6th International Conference on Neural Information Processing Systems, pages 383–390, 1993. [Google Scholar]
  48. Mansour Yishay, Mohri Mehryar, Ro Jae, and Suresh Ananda Theertha. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020. [Google Scholar]
  49. Maurer Andreas. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6(Jun):967–994, 2005. [Google Scholar]
  50. Maurer Andreas, Pontil Massimiliano, and Romera-Paredes Bernardino. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1): 2853–2884, 2016. [Google Scholar]
  51. McMahan Brendan, Moore Eider, Ramage Daniel, Hampson Seth, and Arcas Blaise Aguera y. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017. [Google Scholar]
  52. Nesterov Yurii. Lectures on convex optimization, volume 137. Springer, 2018. [Google Scholar]
  53. Pan Sinno Jialin and Yang Qiang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009. [Google Scholar]
  54. Poushter Jacob. Smartphone ownership and internet usage continues to climb in emerging economies. Pew research center, 22(1):1–44, 2016. [Google Scholar]
  55. Rakhlin Alexander, Shamir Ohad, and Sridharan Karthik. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011. [Google Scholar]
  56. Shalev-Shwartz Shai and Ben-David Shai. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [Google Scholar]
  57. Shalev-Shwartz Shai, Shamir Ohad, Srebro Nathan, and Sridharan Karthik. Stochastic convex optimization. In COLT, 2009. [Google Scholar]
  58. Shamir Ohad and Zhang Tong. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013. [Google Scholar]
  59. Shui Changjian, Chen Qi, Wen Jun, Zhou Fan, Gagné Christian, and Wang Boyu. Beyond ℋ-divergence: Domain adaptation theory with jensen-shannon divergence. arXiv preprint arXiv:2007.15567, 2020. [Google Scholar]
  60. Stich Sebastian U.. Local sgd converges fast and communicates little, 2019.
  61. Tripuraneni Nilesh, Jin Chi, and Jordan Michael I. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684, 2020a. [Google Scholar]
  62. Tripuraneni Nilesh, Jordan Michael I, and Jin Chi. On the theory of transfer learning: The importance of task diversity. arXiv preprint arXiv:2006.11650, 2020b. [Google Scholar]
  63. Vershynin Roman. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
  64. Wang Weiran, Wang Jialei, Kolar Mladen, and Srebro Nathan. Distributed stochastic multi-task learning with graph regularization. arXiv preprint arXiv:1802.03830, 2018. [Google Scholar]
  65. Woodworth Blake, Patel Kumar Kshitij, and Srebro Nathan. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020. [Google Scholar]
  66. Yu Bin. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer, 1997. [Google Scholar]
  67. Yu Tao, Bagdasaryan Eugene, and Shmatikov Vitaly. Salvaging federated learning by local adaptation, 2020.
  68. Yuan Honglin and Ma Tengyu. Federated accelerated stochastic gradient descent. arXiv preprint arXiv:2006.08950, 2020. [Google Scholar]
  69. Zhang Hongyang R, Yang Fan, Wu Sen, Su Weijie J, and Ré Christopher. Sharp biasvariance tradeoffs of hard parameter sharing in high-dimensional linear regression. arXiv preprint arXiv:2010.11750, 2020. [Google Scholar]
  70. Zheng Qinqing, Chen Shuxiao, Long Qi, and Su Weijie J. Federated f-differential privacy. arXiv preprint arXiv:2102.11158, 2021. [PMC free article] [PubMed] [Google Scholar]

RESOURCES