Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying; Zhe Li; Haibo Yang

. Author manuscript; available in PMC: 2026 May 8.

Published in final edited form as: Adv Neural Inf Process Syst. 2025 Dec;38:40156–40201.

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying ¹, Zhe Li ², Haibo Yang ²

PMCID: PMC13152002 NIHMSID: NIHMS2161725 PMID: 42111900

Abstract

This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh decentralized perspective on designing FL algorithms and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result.

1. Introduction

Federated Learning (FL) has emerged as a powerful paradigm for distributed learning, enabling multiple clients to collaboratively train models without sharing raw data. Yet, a central challenge in FL is the arbitrary and unpredictable nature of client participation. In real-world FL, clients may join or leave at will, participate intermittently, or drop out due to connectivity or resource constraints.

Recall the goal of the FL problem is to minimize the following sum-of-loss function:

F (x) : = \frac{1}{N} \sum_{n = 1}^{N} f_{n} (x), f_{n} (x) : = E_{ξ ~ 𝒟_{n}} {\hat{f}}_{n} (x; ξ),

(1)

where $x \in R^{d}$ represents the $d$ -dimensional model parameter and $f_{n}$ stands for the local cost function. It is well established that when clients perform multiple local updates on non-i.i.d. data, their local models tend to diverge. This leads to client drift from the optimal solution of problem (1), a phenomenon that persists even under the often impractical uniform client sampling assumption [Karimireddy et al., 2020, Li et al., 2020]. Moreover, arbitrary client participation introduces another objective bias: instead of converging to the true global optimum, the global model converges to a stationary point of a distorted, participation-weighted objective [Wang et al., 2020, Wang and Ji, 2022]. To mitigate this persistent error, existing methods typically require decaying the learning rate asymptotically to zero, at least in theory. While this strategy can reduce the bias in the limit, it often leads to slower convergence. Hence, a key question naturally arises:

Question: Is it possible to achieve exact convergence under both arbitrary client participation and multiple local updates without decaying the learning rate?

We will provide an affirmative answer to this question in this paper. We begin by introducing a novel analytical framework that reformulates the core operations of FL - client participation, local updates, and model aggregation over time-varying graphs - as a sequence of stochastic matrix multiplications [Horn and Johnson, 2012]. Next, with this tool, we develop a new algorithm FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, which is inspired by decentralized optimization algorithms [Nedic and Ozdaglar, 2009, Sayed et al., 2014, Lian et al., 2017, Lan et al., 2020]. More specifically, we leverage the push-pull technique [Xin and Khan, 2018, Pu et al., 2020] with the time-varying graphs [Nedic et al., 2017, Ying et al., 2021, Nguyen et al., 2025] instead of commonly used static or strongly connected communication graphs, since FOCUS is designed for the FL setting. Compared to the variance reduction technique [Johnson and Zhang, 2013, Defazio et al., 2014] or the adaptively learning participation probabilities, the push-pull approach handles the unknown client participation scenario much better both empirically and theoretically.

Our main contributions are summarized as follows:

We provide a systematic approach to reformulate all core processes of FL – client participation, local updating, and model aggregation through the stochastic matrix multiplication.
We proposed Federated Optimization with Exact Convergence via Push-pull Strategy (FOCUS), which is designed based on the optimization principle instead of heuristic design.
Even under arbitrary client participation, FOCUS exhibits linear convergence (exponential decay) for both strongly convex and non-convex (with PL condition) scenarios without assuming the bounded heterogeneity or decaying the learning rates.
We also introduce a stochastic gradient variant, SG-FOCUS, which demonstrates faster convergence and higher accuracy, both theoretically and empirically.

2. Related Work

FedAvg [McMahan et al., 2017] is the most widely adopted algorithm in FL. It roughly consists of three steps: 1) the server activates a subset of clients, which then retrieves the server’s current model. 2) Each activated client independently updates the model by training on its local dataset. 3) Finally, the server aggregates the updated models received from the clients, computing their average. This process can be represented mathematically as:

where the set $S_{r}$ represents the indices of the sampled clients at the communication round $r$ . The notation $x_{r} \in R^{d}$ stands for the server’s model parameters at $r$ -th round, while $x_{t, i}^{(r)}$ stands for the client $i ’ s$ model at the $t$ -th local update step in the $r$ -th round. We use ⇐ to indicate that communication has happened between clients and the server.

Because of the data heterogeneity and multiple local update steps, Li et al. [2020] has shown that the fixed point of FedAvg is not the same as the minimizer of (1) in the convex scenario. More specifically, they quantified that

{‖x^{o} - x^{⋆}‖}^{2} = Ω ((τ - 1) η) {‖x^{⋆}‖}^{2},

(3)

where $x^{o}$ is the fixed point of the FedAvg algorithm and $x^{⋆}$ is the optimal point. This phenomenon, commonly referred to as client drift [Karimireddy et al., 2020], can be mitigated by introducing a control variate during the local update step, an approach inspired by variance reduction techniques [Johnson and Zhang, 2013]. Prominent examples of this strategy, including SCAFFOLD [Karimireddy et al., 2020] and ProxSkip [Mishchenko et al., 2022], can further circumvent the need for a bounded heterogeneity assumption. Yet, this approach incurs increased communication costs, doubling them due to the transmission of a control variate with the same dimensionality as the model parameters.

Many analytical studies on FL assume that the sampled clients are drawn from a uniform distribution, an assumption shared by the literature cited in the preceding paragraph, but this is almost impractical in reality [Kairouz et al., 2021, Xiang et al., 2024, Li et al., 2025]. Wang and Ji [2022] shows that FedAvg might fail to converge to $x^{⋆}$ under non-uniform sampling distributions, even with a decreasing learning rate $η$ . To address the challenges posed by non-uniformity, a common approach involves either explicitly knowing or adaptively learning the client participation probabilities during the iterative process and subsequently modifying the averaging weights accordingly [Wang and Ji, 2024, Wang et al., 2024, Xiang et al., 2024]. Yet, neither of them can achieve exact convergence, and the learning process may slow down the convergence. An alternative approach is to use Variance Reduction (VR) techniques, as seen in methods like MIFA Gu et al. [2021] and FedVARP Jhunjhunwala et al. [2022]. Yet, the heuristic integration of VR with FL often fails to jointly address the client drift issue. This results, once again, in inexact convergence when a constant learning rate is used.

It is known that FL and decentralized optimization are closely related [Lalitha et al., 2018, Koloskova et al., 2020, Kairouz et al., 2021], and this work is closely related to the tools introduced in the decentralized optimization society. We leave a detailed decentralized literature review in Appendix A

3. Graph, Stochastic Matrix, and Arbitrary Client Participation

FL algorithms are commonly expressed in a per-client style, as exemplified by the previously highlighted FedAvg formulation (2a)–(2c). While this representation offers ease of understanding and facilitates straightforward programming implementation, a stacked vector-matrix representation can unlock more powerful mathematical tools for the design and analysis of FL algorithms.

To illustrate the concept, let us consider two toy examples of vector-matrix multiplication:

W_{assign} x = [\begin{array}{l} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{array}] [\begin{array}{l} x_{0} \\ x_{1} \\ x_{2} \end{array}] = [\begin{array}{l} x_{0} \\ x_{0} \\ x_{2} \end{array}], W_{avg} x = [\begin{matrix} 0 & 0.5 & 0.5 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}] [\begin{array}{l} x_{0} \\ x_{1} \\ x_{2} \end{array}] = [\begin{matrix} (x_{1} + x_{2}) / 2 \\ x_{1} \\ x_{2} \end{matrix}]

While the calculations themselves are straightforward, their significance lies in the appropriate interpretation of the matrices and vectors within the FL context. We interpret $x_{i} \in R^{1 \times d}$ as the model parameter stored in the worker $i$ . Index 0 is assigned for the server, and the result indices are for clients. Then, the first $W_{assign}$ can be viewed as the server assigning its value $x_{0}$ to client 1 while the value of client 2 is unchanged as the same as not participated scenario. The second $W_{avg}$ can be viewed as the server setting its own value as the average of the value of worker 1 and worker 2. These two toy matrices reflect the pull and aggregate model – two key steps in the FedAvg algorithm.

More formally, given a sampled client indices set $S_{r}$ , subscript $r$ for the $r$ -th round, we define the model-assign matrix $R (S_{r})$ and the model-average matrix $A (S_{r})$ as

R (S_{r}) [i, j] = \{\begin{array}{l} 1 & if i \in S_{r} and j = 0 \\ 1 & if i \notin S_{r} and j = i \\ 0 & otherwise \end{array}, A (S_{r}) [i, j] = \{\begin{array}{l} 1 & if i = j \neq 0 \\ 1 / |S_{r}| & if i \in S_{r} and j = 0 \\ 0 & otherswise \end{array}

(4)

While the mathematical notation of the matrix may not be immediately apparent, its structure should be clear to see the illustration provided in Figure 1. In the figure, we utilize the graph language to visualize the matrix $W$ . We can treat $W$ as a weighted adjacency matrix; the non-zero value entry $W [i, j]$ implies a link from node $j$ to node $i$ . Hence, $W$ is also commonly referred as the mixing matrix. Suppose $S_{r} = {1,3}$ , then the matrices $R (S_{r})$ and $A (S_{r})$ correspond to the leftmost and second leftmost matrices and graphs depicted in the figure, respectively.

Figure 1: — The graph representation of the communication pattern of 5 nodes and its possible corresponding stochastic matrices. For clearness, the self-loop is not drawn. If the node 0 is treated as server and node 1 to 4 as clients, the leftmost is a typical pull-model step, i.e. client 1 and 3 are participated; the second left graph depicts the model average step in the FedAvg; the third graph is a same graph but using column-stochastic matrix, which is uncommon in the FL literature; The last one is a typical (symmetric) doubly stochastic matrix case used in the decentralized optimization algorithm.

The weights are selected to ensure the resulting matrix is a stochastic matrix. Specifically, a matrix $W$ is called row stochastic if $W 1 = 1$ , where $1$ is a all-one vector; it is called column stochastic if $1^{⊤} W = 1^{⊤}$ ; and it is doubly stochastic if it satisfies both row and column stochastic properties [Horn and Johnson, 2012, Meyer, 2023]. It is straightforward to verify that the above two matrices both are row-stochastic matrices. Analogously, for the participation set, we can define a corresponding column stochastic matrix $C (S_{r})$ and a doubly stochastic matrix $W (S_{r})$ .

C (S_{r}) [i, j] = \{\begin{array}{l} 1 & if j \in S_{r} and i = 0 \\ 1 & if j \notin S_{r} and i = j \\ 0 & otherswise \end{array}, W (S_{r}) [i, j] = \{\begin{array}{l} 1 / |S_{r}| & if i \in S_{r} and j = 0 \\ 1 / |S_{r}| & if j \in S_{r} and i = 0 \\ 1 - \sum_{i} W [i, j] & if i = j \\ 0 & otherswise \end{array}

Suppose $S_{r} = {1,3}$ , then the matrices $W (S_{r})$ and $C (S_{r})$ correspond to the rightmost and second rightmost ones depicted in the figure. These four matrices will play the critical role in the following algorithm design and convergence proof section. In contrast to decentralized algorithms, where assumptions are directly imposed on the mixing matrix, we do not make any assumption about them in this paper since we utilize them to model the client participation process. For completeness, a brief review of stochastic matrices and their properties is provided in the Appendix C.

3.1. Arbitrary Client Participation Modeling

FL focuses on the process of generating the arbitrary client participation set $S_{r}$ . Inspired by Wang and Ji [2022], in this paper, we model the arbitrary client participation by the following assumption.

Assumption 1 (Arbitrary Client Participation). In each communication round, the participation of the $i$ -th worker is indicated by the event $I_{i}$ , which occurs with a unknown probability $p_{i} \in (0,1]$ . $I_{i} = 1$ indicates that the $i$ -th worker is activated while $I_{i} = 0$ indicates not. The corresponding averaging weights are denoted by $q_{i}$ , where $q_{i} = E [I_{i} / (\sum_{j = 1}^{N} I_{j})]$ .

Assumption 1 is a general one covering multiple cases:

Case 1: Full Client Participation. This is simply as $p_{i} \equiv 1$ and $q_{i} \equiv \frac{1}{N}$ for all client indices $i$ .

Case 2: Active Arbitrary Participation. Each client $i$ independently determines if they will participate in the communication round. The event $I_{i}$ follows the Bernoulli distribution $p_{i}$ , where $p_{i} \in (0,1]$ . (Note $\sum_{i} p_{i} \neq 1$ .) If ${\{p_{i}\}}_{i = 1}^{N}$ are close to each other, then $q_{i} \approx p_{i} / (\sum_{j} p_{j})$ .

Case 3: Passive Arbitrary Participation. The server randomly samples $m$ clients in each round. Each client is randomly selected without replacement according to the category distribution with the normalized weights $q_{1}, q_{2}, \dots, q_{N}$ , where $\sum_{i} q_{i} = 1, q_{i} > 0 . p_{i}$ does not have a simple closed form. But if it is sampled with replacement, then $p_{i} = 1 - {(1 - q_{i})}^{m}$ .

Case 3a: Uniform Sampling. This is a special case of case 3, where $p_{i} \equiv m / N$ and $q_{i} \equiv 1 / m$ .

Passive arbitrary participation is often referred to as arbitrary client sampling. We also use “sampling” and “client participation” interchangeably throughout this paper. Now, considering that $S_{r}$ is generated according to Assumption 1, it can be readily verified that the corresponding assigning matrix and averaging matrix possess the following property:

\overline{R} = E R (S_{r}) = [\begin{matrix} 1 & 0 & \dots & 0 \\ q_{1} & 1 - q_{1} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ q_{N} & 0 & \dots & 1 - q_{N} \end{matrix}], \overline{A} = E A (S_{r}) = [\begin{matrix} 0 & q_{1} & \dots & q_{N} \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \end{matrix}]

(5)

The column-stochastic matrix equals $(\overline{C} = E [C (S_{r})] = {\overline{R}}^{⊤})$ by definition. While doubly stochastic matrices are prevalent in the decentralized literature, they are not often applicable to FL-style algorithms, and therefore we do not discuss them further. With this approach, we effectively transform the problem of arbitrary client participation probabilities into an analysis of the matrix properties of $R (S_{r}), A (S_{r})$ , and $C (S_{r})$ as which we will exploit in the subsequent section.

4. From Interpretation to Correction: A New Federated Optimization with Exact Convergence via Push-pull Strategy - `FOCUS`

In this section, we demonstrate how leveraging the graph and stochastic matrix can facilitate the development of more powerful FL algorithms.

4.1. Interpret FedAvg as Decentralized Algorithm with Time-Varying Graphs

A direct application of the above mixing matrix is that we can concisely represent the FL algorithm in vector-matrix form, similar to decentralized algorithms [Li et al., 2020, Koloskova et al., 2020].

First, let $x_{k} = vstack [x_{k, 0}; x_{k, 1}; \dots; x_{k, N}] \in R^{(N + 1) \times d}$ denote the state at iteration $k$ . This matrix is formed by vertically stacking the server’s model parameters $x_{k, 0} \in R^{d}$ and the model parameters $x_{k, i} \in R^{d}$ from the $N$ workers. Similarly, let $\nabla f (x_{k}) = vstack [0; \nabla f_{1} (x_{k, 1}); \dots; \nabla f_{N} (x_{k, N})] \in R^{(N + 1) \times d}$ represent the corresponding stacked vector of local gradients at iteration $k$ .¹ Note that the first component of the stacked gradient is $0$ because the server holds no data. This implies that the server’s local loss function is identically zero, $f_{0} (x) \equiv 0$ , and consequently, $\nabla f_{0} (x_{k, 0}) = 0$ . This also ensures that including the server’s term $f_{0}$ does not alter the original loss function defined in (1).

Next, observe that during the local update phase of FedAvg, nodes compute updates independently without communication. In the context of our stochastic matrix, this corresponds to using the identity matrix, $I$ . To represent the algorithm with a single iteration index $k$ , we map the $t$ -th local update in the $r$ -th communication into the $k$ -th iteration, where $k = r τ + t$ . Using the tools previously introduced, we can now reformulate FedAvg (2a)–(2c) as the following one-index iterative form:

where the time-varying matrices $R_{k}, A_{k}$ , and $D_{k}$ are defined as

R_{k} = \{\begin{array}{l} R (S_{r}) & k = r τ + 1 \\ I & otherwise \end{array}, A_{k} = \{\begin{array}{l} A (S_{r}) & k = (r + 1) τ \\ I & otherwise \end{array}, D_{k} [i, j] = \{\begin{array}{l} 1 & if i = j \in S_{r} \\ 0 & otherswise \end{array} .

(7)

This diagonal matrix $D_{k}$ serves to deactivate unparticipated workers and the server during local updates. $S_{r}$ is the set of participated clients’ indices at round $r$ , which can be determined by the iteration $k$ , i.e., $r τ \leq k < (r + 1) τ$ . An illustration of this process using graphs is shown in Figure 2.

Figure 2: — Represent FedAvg using graphs. The dashed line means no communication.

Mixing Matrices in FedAvg. It is feasible to further condense (6a)–(6c) into a single-line form

x_{k + 1} = W_{k} (x_{k} - η \nabla f (x_{k}))

(8)

The specific selection of $W_{k}$ is detailed in the Appendix. But $W_{k}$ cannot be a doubly stochastic matrix unless it is a full client participation case. Consequently, the theorem presented in [Koloskova et al., 2020] is not directly applicable to FedAvg in this context.

Convergence Result of FedAvg with Arbitrary Participations. In the appendix D, we provide a new proof of FedAvg under the arbitrary participation scenario through this decentralized optimization formulation. When the algorithm $k \to \infty$ , the limiting point of FedAvg is around an irreducible neighborhood depending on the local update steps $τ$ , data heterogeneity $σ_{g}^{2}$ , and the extra bias $δ_{q}^{2}$ introduced due to non-uniform participation probabilities. This motivates us to develop a new FL algorithm capable of addressing and eliminating all aforementioned errors and biases.

4.2. `FOCUS` Corrects Arbitrary Client Participation and Local-Update Bias

4.2.1. Push-Pull Strategy for FL Settings

To eliminate the biases introduced by arbitrary client participation, we move beyond heuristic designs and adopt a formal optimization framework. This involves reformulating the FL problem as a constrained optimization task, a structure commonly employed in decentralized algorithms:

min_{\{x_{0}, x_{1}, \dots, x_{N}\}} F (x) = \frac{1}{N} \sum_{i = 0}^{N} f_{i} (x_{i})

(9)

s.t. R (S_{r}) x = x, \forall S_{r}

(10)

Note a minor but critical difference from the formulation (1) is that there are $N + 1$ model parameters $x_{i}$ applied in each local cost function $f_{i}$ instead of a single $x$ . To see the equivalence between this and (1), notice $R (S_{r}) x = x$ implies $x_{i} = x_{0}, \forall i \in S_{r}$ . Consequently, if the union of all sampled client sets $\{S_{r}\}$ covers the entire client population, then all individual client models and the server model are constrained to converge to the same state.

This formulation motivated us to explore a primal-dual approach to solve this constrained problem. Among the various primal-dual-based decentralized algorithms, the push-pull algorithm aligns particularly well with the FL setting. It is characterized by the following formulation:

x_{k + 1} = R (x_{k} - η_{k} y_{k})

(11)

y_{k + 1} = C (y_{k} + \nabla f (x_{k + 1}) - \nabla f (x_{k})),

(12)

where $y_{0} = \nabla f (x_{0})$ , and $R$ and $C$ represent row-stochastic and column-stochastic matrices, respectively. The algorithm name “push-pull” arises from the intuitive interpretation of these matrices. The row-stochastic matrix $R$ can be interpreted as governing the “pull” operation, where each node aggregates information from its neighbors. Conversely, the column-stochastic matrix $C$ governs the “push” operation, where each node disseminates its local gradient information to its neighbors. Moreover, recalling the definition of row and column stochastic matrices $R 1 = 1$ and $1^{⊤} C = 1^{⊤}$ , push-pull algorithm has the following interesting properties:

x^{⋆} = R x^{⋆}, (consensus property)

1^{⊤} y_{k} = 1^{⊤} \nabla f (x_{k}), \forall k (tracking property)

where $x^{⋆}$ is the fixed point of the algorithm under some mild conditions on the static graph $R$ and $C$ . The first property, consensus, implies that all workers’ model parameters eventually converge to a common value. The second property, tracking, indicates that the sum of the variables $y$ (aggregated across workers) approximates the global gradient, ensuring the algorithm’s iterates move in a direction that minimizes the global loss function. It is worth pointing out that when consensus is reached such that all relevant local models in $x_{k}$ equal some $\overline{x}$ , the sum of the local gradients $1^{⊤} \nabla f (x_{k})$ becomes exactly $N \nabla F (\overline{x})$ . For more details, we refer the readers to Pu et al. [2020], Xin and Khan [2018].

4.2.

We are interested in solving the optimization problem with multiple constraints problem (9)–(10). The original push-pull algorithm is not sufficient. Analogous to the approach taken in the FedAvg section, Here, we extend it to the time-varying matrices $R_{k}$ and $C_{k}$ to model the client sampling and local update processes, respectively. These modifications lead to the following algorithmic formulation:

x_{k + 1} = R_{k} (x_{k} - η D_{k} y_{k})

(13)

y_{k + 1} = C_{k} (y_{k} + \nabla f (x_{k + 1}) - \nabla f (x_{k})),

(14)

where the definition of $R_{k}$ is the same as the one in FedAvg and $C_{k} = R_{k}^{⊤}$ while $D_{k}$ is slightly different from (7) about the server’s entry. $D_{k} [0,0] = 1$ if $k = r τ + 1$ otherwise 0. The graph representation of this algorithm is shown in Figure 3.

Figure 3: — Illustration of our new `FOCUS` algorithm. There are two key differences from FedAvg style algorithm. One is it pulls the model variable $x$ but pushes the gradient variable $y$ , and another is the push matrix is the column stochastic matrix instead of the row stochastic.

4.2.2. Convert Vector-Matrix Form Back To FL-Style Algorithm

Substituting the definition of the mixing matrix into (13) and (14), we will get a concrete FL algorithm as listed in Algorithm 1 with non-trivial transformations. The steps to establish this new FL algorithm effectively reverses the process outlined in the previous subsection. Here we provide a few key steps. First, it is straightforward to verify that $x_{k, i}$ and $y_{k, i}$ are not moved if the client $i$ is not participating in the corresponding round, so we will ignore them in the next derivation. At the beginning of the $r$ -th round, i.e. $k = r τ + 1$ , (13) becomes

x_{k + 1,0} = x_{k, 0} - η y_{k, 0} (server updates)

(15)

x_{k + 1, i} \Leftarrow x_{k + 1,0}, \forall i \in S_{r} (client pulls model)

(16)

While at the end of the $r$ -th round, i.e. $k = (r + 1) τ$ , (14) becomes

y_{k + 1, i}^{'} = y_{k, i} + \nabla f_{i} (x_{k + 1, i}) - \nabla f_{i} (x_{k, i}), \forall i \in S_{r}

(17)

y_{k + 1,0} \Leftarrow y_{k, 0} + \sum_{i \in S_{r}} y_{k + 1, i}^{'} (server collects info)

(18)

y_{k + 1, i} \Leftarrow 0, \forall i \in S_{r} (client resets y_{k})

(19)

Note that we introduce a temporary variable $y_{k + 1, i}^{'}$ because the matrix multiplication $C_{k}$ is applied on the updated value $y_{k}^{'}$ instead of $y_{k}$ directly. During local updates, the server does not update the value while the client executes the local update in the gradient tracking style:

x_{k + 1, i} = x_{k, i} - η y_{k, i}

(20)

y_{k + 1, i} = y_{k, i} + \nabla f_{i} (x_{k + 1, i}) - \nabla f_{i} (x_{k, i})

(21)

Next, we revert to the standard two-level indexing used in FL by mapping the single iteration index $k = r τ + t$ to the $r$ -th iteration and $t$ -th local update step and replacing $x_{k + 1, i}$ by $x_{t, i}^{(r)}$ .

Finally, assembling all the above equations together and switching the order of $x$ and $y$ , we arrive at the FOCUS shown in Algorithm 1. Because of the switched order, at the beginning of each round, i.e. $k = r τ$ , the $y$ -update becomes

y_{1, i}^{(r)} = y_{0, i}^{(r)} + \nabla f_{i} (x_{0, i}^{(r)}) - \nabla f_{i} (x_{- 1, i}^{(r)})

(22)

Note that in the original update rule (21), the gradient $\nabla f_{i} (x_{k, i})$ is computed in the preceding step and then reused, thus avoiding redundant computation at the current step. This principle of gradient reuse carries over directly to the two-level index notation. Recall that $x_{t, i}^{(r)}$ will not change if the worker $i$ does not participate. Hence, we can establish, by induction, that $\nabla f_{i} (x_{- 1, i}^{(r)})$ corresponds to the stored gradient from the end of the most recent round in which the worker participated.

5. Performance Analysis

Now we are ready to present the necessary assumptions and convergence property for FOCUS. Due to limited space, all proofs are deferred to Appendix E.

Assumption 2 ( $L$ -Smoothness). All local cost functions $f_{i}$ are $L$ -smooth, i.e., $f_{i} (x) \leq f_{i} (y) + ⟨x - y, \nabla f_{i} (y)⟩ + \frac{L}{2} ‖ x - y ‖^{2}$ .

Assumption 3 ( $μ$ -Strong Convexity). All local cost functions $f_{i}$ are $μ$ -strongly convex, that is, $f_{i} (x) \geq f_{i} (y) + ⟨x - y, \nabla f_{i} (y)⟩ + \frac{μ}{2} ‖ x - y ‖^{2}$ .

Assumption 4 (PL Condition). The global loss function $F$ satisfies the Polyak-Lojasiewicz condition $‖ \nabla F (x) ‖^{2} \geq 2 β (F (x) - F^{⋆}), \forall x$ , where $β > 0$ and $F^{⋆}$ is the optimal function value.

Theorem 1. Under arbitrary participation assumption 1 and $L$ –Smoothness assumption 2, it can be proved that FOCUS converges at the following rates with various extra assumptions on $f_{i}$ :

$μ$ –Strongly Convex: Under extra assumption 3, if $η \leq min \{\frac{3 μ}{27 N L^{2}}, \frac{1}{3 L (τ - 1)}, \frac{q_{min}^{3 / 2}}{8 L \sqrt{N}}\}$ ,
$E {‖{\overline{x}}_{R τ + 1} - x^{⋆}‖}^{2} \leq Ψ_{R} \leq (1 - η μ N / 2)^{R} Ψ_{0}$ (23)
$β$ -PL Condition: Under extra assumption 4, if $η \leq min \{\frac{3 q_{min}}{32 N}, \frac{q_{min}}{12 β N}, \frac{q_{min}}{16 L^{2}}, \frac{q_{min}^{3 / 2}}{8 L \sqrt{N}}\}$ ,
$E F ({\overline{x}}_{R τ + 1}) - F^{⋆} \leq Φ_{R} \leq (1 - η β N)^{R} Φ_{0}$ (24)
General Nonconvex: Under no extra assumption, if $η \leq min \{\frac{1}{2 L (τ - 1)}, \frac{q_{min}^{3 / 2}}{8 L \sqrt{N}}, \frac{q_{min}}{16 L \sqrt{2 N}}, \frac{1}{4 L N}\}$ ,
$\frac{1}{R} \sum_{r = 0}^{R - 1} E {‖\nabla f (x_{R τ + 1})‖}^{2} \leq \frac{8 (f (x_{1}) - f^{⋆})}{η N R},$ (25)

where the Lyapunov functions $Ψ_{r} : = E {‖{\overline{x}}_{r τ + 1} - x^{⋆}‖}^{2} + (1 - 8 η τ L N) E {‖1 {\overline{x}}_{(r - 1) τ + 1} - x_{r τ}‖}_{F}^{2}, Φ_{r} = E F ({\overline{x}}_{r τ + 1}) - F^{⋆} + (1 - 4 η L^{2}) E {‖1 {\overline{x}}_{r τ + 1} - x_{r τ}‖}^{2}$ and $q_{min} = {min}_{i} q_{i}$ . □

Remark. Note the top two error terms are exponentially decayed, which implies the iteration complexity is $𝒪 (log (1 / ϵ))$ . For the general non-convex case, we improve the typical $1 / \sqrt{R}$ rate into $1 / R$ thanks to the exact convergence property. See the comparison of our proposed algorithm with other common FL algorithms in Table 1. $𝒪 (1 / ϵ^{2}) > 𝒪 (1 / ϵ) ≫ 𝒪 (log (1 / ϵ))$ in terms of communication and computation complexity. Table 1 highlights the superior performance of FOCUS, which achieves the fastest convergence rate in all scenarios without particular sampling or heterogeneous gradients assumption.

Table 1:

Comparison of multiple algorithms.

Algorithm	Exact Converg.¹	Strongly-Convex Complexity²	Non-Convex Complexity	Assumptions⁵
Algorithm	Exact Converg.¹	Strongly-Convex Complexity²	Non-Convex Complexity	Participation	Hetero. Grad.	Extra Comment
FedAvg [Li et al., 2020]	✘	$O (\frac{1}{ϵ})$	$O (\frac{1}{ϵ^{2}})$	Uniform	Bounded	Bounded gradient assumption
LocalSGD [Koloskova et al., 2020]	✘	$O (\frac{1}{\sqrt{ϵ}})$	$O (\frac{1}{ϵ^{3 / 2}})$	Uniform	Bounded	Doubly stochastic matrix
FedAU [Wang and Ji, 2024]	✘	–	$O (\frac{1}{ϵ^{2}})$	Arbitrary	Bounded	Bounded global gradient
FedAWE [Xiang et al., 2024]	✘	–	$O (\frac{1}{ϵ^{2}})$	Arbitrary	Bounded	Doubly stochastic matrix
SCAFFOLD [Karimireddy et al., 2020]	✘ ³	$O (log (\frac{1}{ϵ}))$	$O (\frac{1}{ϵ})$	Uniform	None	Comm. $2 d$ vector per round⁶
ProxSkip/ScaffNew [Mishchenko et al., 2022]	✘	$O (log (\frac{1}{ϵ}))$	–	Full	None	Comm. $2 d$ vector per round
MIFA [Gu et al., 2021]	✘	–	$O (\frac{1}{ϵ^{2}})$	Arbitrary	Bounded	Bounded delay Assump. + Server stores each client model
`FOCUS` (This paper)	✔	$O (log (\frac{1}{ϵ}))$	$O (log (\frac{1}{ϵ}))$ ⁴	Arbitrary	None	No need to learn partici. prob.

Open in a new tab

Exact convergence refers to the algorithm’s ability to converge to the exact solution under arbitrary sampling, without requiring a decaying learning rate.

Complexity refers to the number of iterations required for the algorithm to achieve an error within $ϵ$ of the optimal solution. We have removed the impact of the stochastic gradient variance in all rates.

There is no convergence proof of SCAFFOLD under arbitrary client participation scenario. Empirically, we observed it may be possible.

⁴

This rate is established with PL condition.

⁵

Arbitrary participation refers to Assumption 1 and the bounded heterogeneous gradient are the assumptions that $‖f_{i} (x) - F (x)‖ \leq σ_{G}$ .

⁶

It is possible to reduce the uplink communication into $d$ while downlink one is still $2 d$ [Huang et al. 2024].

Numerical Validation. To validate our claims, we conducted a numerical experiment using synthetic data since this is the common approach to verify the exact convergence property. The results, presented in Figure 4, were obtained by applying the algorithms to a simple ridge regression problem with the parameters $d = 100, N = 16, K = 100, λ = 0.01$ , and $τ = 5$ . The loss function is $F (x) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} {‖a_{i, k}^{⊤} x - b_{i, k}‖}^{2} + λ ‖ x ‖^{2}$ All algorithms employed the same learning rate, $η = 2 e - 4$ . Three distinct sampling scenarios were examined: full client participation, uniform participation, and arbitrary participation. Notably, our FOCUS exhibits linear convergence and outperforms the other algorithms in all scenarios, particularly under arbitrary participation.

Figure 4: — Convergence performance comparison of various FL algorithms. Under full client participation, FedAvg, FedAU, and MIFA exhibit identical performance, as do SCAFFOLD and ProxSkip, due to their theoretical equivalence in this setting. FedAvg and FedAU fail to converge to the optimal solution across all scenarios because their inherent error and bias cannot be eliminated using a fixed learning rate. ProxSkip diverges under uniform and arbitrary participation, as it is not designed for these conditions. We do not understand why MIFA diverges but it works in ML applications. While SCAFFOLD converges in all cases, our proposed algorithm, `FOCUS`, demonstrates faster convergence, especially under arbitrary participation.

5.1. Why `FOCUS` Can Converge Exactly for Arbitrary Participation Probabilities?

At first glance, the ability of FOCUS to achieve exact convergence under arbitrary client sampling probabilities may appear counterintuitive. Unlike other approaches, FOCUS neither requires knowledge of the specific participation probabilities nor necessitates adaptively learning these rates. The sole prerequisite for convergence is that each client maintains a non-zero probability of participation. Plus, the push-pull algorithm was never designed to solve the arbitrary sampling problem.

From an algorithmic perspective, FOCUS closely resembles the delayed/asynchronous gradient descent algorithm even though it is derived from a push-pull algorithm to fit the FL scenario. To see that, leveraging the tracking property of the variable $y_{k}$ and special construction of matrix $C_{k}$ , we can establish that the server’s $y_{r + 1} = \sum_{i = 1}^{N} \nabla f_{i} (x_{k + 1, i})$ . Due to arbitrary client participation, at the iteration $k, x_{k + 1, i}$ may hold some old version of the server’s model if it does not participate. Thus, we arrive at an insightful conclusion: FOCUS effectively transforms arbitrary participation probabilities into an arbitrary delay in gradient updates. Hence, any client participation scheme, as long as each client participates with a non-zero probability, will still guarantee exact convergence.

5.2. Extension to Stochastic Gradients and ML Applications

In practical machine learning scenarios, computing full gradients is often computationally prohibitive. Therefore, stochastic gradient methods are commonly employed. Our proposed algorithm can be readily extended to incorporate stochastic gradients, resulting in the variant SG-FOCUS. However, due to space constraints, we focus on the deterministic setting in the main body of this paper. A comprehensive description of SG-FOCUS, along with its convergence analysis, is provided in Appendix F. The appendix also benchmarks SG-FOCUS’s performance on the CIFAR-10 classification task, highlighting its faster convergence and improved accuracy over other FL algorithms. This performance trend echoes that of its deterministic counterpart.

Theorem 2 (Informal Convergence Theorem of SG-FOCUS). Under arbitrary participation assumption 1, $L$ –Smoothness assumption 2, and unbiased and bounded variance assumption on stochastic gradient (See assumption 6 in appendix F), it can be proved that SG-FOCUS converges at the following rates with various extra assumptions on $f_{i}$ :

$μ$ -Strongly Convex: Under extra assumption 3, for sufficiently small learning rate $η$ , we have
$Γ_{R} \leq {(1 - \frac{η μ N}{2})}^{R} Γ_{0} + \frac{4 (q_{min} + N^{2})}{μ N q_{min}} η σ^{2},$ (26)

where the Lyapunov functions $Γ_{r} : = E {‖{\overline{x}}_{(r + 1) τ + 1} - x^{⋆}‖}^{2} + (1 - 8 η L^{2} N / μ) E ‖ 1 {\overline{x}}_{(r - 1) τ + 1} - x_{(r - 1) τ} ‖_{F}^{2}$ and $σ$ is the variance upper bound of the stochastic gradient noise.
$β$ -PL Condition: Under extra assumption 4, for sufficient small learning rate $η$ , we have
$Ω_{R} \leq (1 - η β N)^{R} Ω_{0} + (\frac{L}{2} + 32 (τ - 1)^{2} L^{2} + \frac{8}{q_{min}}) \frac{η}{β} σ^{2},$ (27)

Where $Ω_{r} : = F ({\overline{x}}_{r τ + 1}) - F^{⋆} + (1 - 4 η L^{2}) E {‖1 {\overline{x}}_{(r - 1) τ + 1} - x_{(r - 1) τ}‖}^{2} .$
General Nonconvex: Under no extra assumption, for sufficient small learning rate $η$ , we have
$\frac{1}{R} \sum_{r = 0}^{R - 1} E {‖\nabla F ({\overline{x}}_{r τ + 1})‖}^{2} \leq \frac{8 (F ({\overline{x}}_{1}) - F^{⋆})}{η N R} + (2 L N + \frac{8 N^{2}}{(N - 8 L^{2}) q_{min}}) η^{2} σ^{2} .$ (28)

See the formal statement and the proof in Appendix F.

Remarks on Linear Speedup. A common expectation in FL theory is the demonstration of a linear speedup in the convergence rate, where the rate scales proportionally with the number of clients, $N$ . By inspecting the general non-convex convergence rate in Theorem 1, we see the error residual term is $𝒪 (1 / (η N R))$ . Yet, the stability of FOCUS necessitates a learning rate $η$ that is restricted by $𝒪 (1 / N)$ . This $N$ dependence cancels out. We want to highlight that this result is expected. The linear speedup typically holds when N is used to average out stochastic noise (like in Stochastic Gradient Descent variants). Since FOCUS is an exact algorithm, it does not introduce this stochasticity, and therefore, it is natural that the linear speedup benefit from increasing the number of clients is not reflected in the convergence bound. In contrast to the analysis of FOCUS, the convergence rate of SG-FOCUS in the general non-convex setting does indeed reflect the benefits of client aggregation. Specifically, by setting the learning rate $η$ to $𝒪 (1 / N)$ , the stochastic variance term diminishes proportionally to $𝒪 (1 / N)$ , which confirms the presence of the linear speedup.

6. Conclusion

This work addresses the critical challenges of arbitrary client participation and client drift in Federated Learning, two factors that prevent traditional algorithms from achieving exact convergence. By introducing a novel framework based on stochastic matrices and time-varying graphs, we model these dynamics and reformulate the FL problem as a constrained optimization task. This principled approach, moving beyond simple heuristics, led to the development of FOCUS, an algorithm derived from the push-pull optimization strategy. Our theoretical analysis and numerical experiments demonstrate that FOCUS can achieve exact linear convergence under any client participation scheme, without needing to know or learn participation probabilities. The extension to a stochastic gradient setting, SG-FOCUS, further validates its practical effectiveness.

Limitations

The arbitrary client participation modeling used in the proof of this paper did not consider the Markov process, i.e., the client participation probabilities depend on the participation status in the last round. We believe that FOCUS still converges exactly under this scenario since the stochastic matrix modeling and push-pull strategy still hold for any realizations. However, the extension of the proof is non-trivial due to the correlation between stochastic matrices. We leave this and a more general arbitrary participation scenario for future research directions.

Future Works.

The framework of stochastic matrices and time-varying graphs provides a novel tool for modeling arbitrary client participation and local update dynamics in FL. By leveraging this approach, we establish a formal connection between FL and the rich field of decentralized optimization. While this paper focused on FedAvg and the Push-Pull algorithm as initial examples, a promising avenue for future research is to adapt other sophisticated decentralized algorithms.

Supplementary Material

NIHMS2161725-supplement-1.pdf^{(661.4KB, pdf)}

Acknowledgement

The authors gratefully acknowledge Edward Duc Hien Nguyen and Xin Jiang for the discussion that inspired the connection between time-varying graphs and client sampling. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R16GM159671. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

In the decentralized optimization literature, it is common to represent parameters $x$ and gradients $\nabla f (x)$ as row vectors (dimension $1 \times d)$ . This allows the graph mixing operation, defined by a matrix $W \in R^{(N + 1) \times (N + 1)}$ , to be concisely expressed as $W x_{k}$ .

The code is available at https://github.com/BichengYing/FedASL. The algorithm was originally named Federated Learning for Arbitrary Sampling and Local Update (FedASL). The acronym ASL also stands for the Adaptation System Laboratory at UCLA and EPFL, where Dr. Ying completed his Ph.D.

39th Conference on Neural Information Processing Systems (NeurIPS 2025).

References

Assran M, Loizou N, Ballas N, and Rabbat M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pages 344–353. PMLR, 2019. [Google Scholar]
Beltrán ETM, Pérez MQ, Sánchez PMS, Bernal SL, Bovet G, Pérez MG, Pérez GM, and Celdrán AH. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 2023. [Google Scholar]
Cattivelli FS, Lopes CG, and Sayed AH. Diffusion recursive least-squares for distributed estimation over adaptive networks. IEEE Transactions on Signal Processing, 56(5):1865–1877, 2008. [Google Scholar]
Chen J and Sayed AH. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289–4305, 2012. [Google Scholar]
Defazio A, Bach F, and Lacoste-Julien S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014. [Google Scholar]
Fang M, Zhang Z, Hairi P. Khanduri, Liu J, Lu S, Liu Y, and Gong N. Byzantine-robust decentralized federated learning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 2874–2888, 2024. [Google Scholar]
Gu X, Huang K, Zhang J, and Huang L. Fast federated learning in the presence of arbitrary device unavailability. Advances in Neural Information Processing Systems, 34:12052–12064, 2021. [Google Scholar]
Horn RA and Johnson CR. Matrix analysis. Cambridge university press, 2012. [Google Scholar]
Huang X, Li P, and Li X. Stochastic controlled averaging for federated learning with communication compression. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jj5ZjZsWJe. [Google Scholar]
Jhunjhunwala D, Sharma P, Nagarkatti A, and Joshi G. Fedvarp: Tackling the variance due to partial client participation in federated learning. In Uncertainty in Artificial Intelligence, pages 906–916. PMLR, 2022. [Google Scholar]
Johnson R and Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013. [Google Scholar]
Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R, et al. Advances and open problems in federated learning. Foundations and trends^® in machine learning, 14(1–2):1–210, 2021. [Google Scholar]
Karimireddy SP, Kale S, Mohri M, Reddi S, Stich S, and Suresh AT. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020. [Google Scholar]
Koloskova A, Loizou N, Boreiri S, Jaggi M, and Stich S. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020. [Google Scholar]
Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. 2009.
Lalitha A, Shekhar S, Javidi T, and Koushanfar F. Fully decentralized federated learning. In Third workshop on bayesian deep learning (NeurIPS), volume 2, 2018. [Google Scholar]
Lan G, Lee S, and Zhou Y. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, 180(1):237–284, 2020. [Google Scholar]
Li X, Huang K, Yang W, Wang S, and Zhang Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJxNAnVtDS [Google Scholar]
Li Z, Shi W, and Yan M. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019. [Google Scholar]
Li Z, Nabavirazavi S, Ying B, Iyengar S, and Yang H. Fast: A lightweight mechanism unleashing arbitrary client participation in federated learning. In Proceedings of International Joint Conference on Artificial Intelligence, pages 5644–5652, 2025. URL 10.24963/ijcai.2025/628. [DOI] [Google Scholar]
Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems, 30, 2017. [Google Scholar]
McMahan B, Moore E, Ramage D, Hampson S, and y Arcas BA. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. [Google Scholar]
Meyer CD. Matrix analysis and applied linear algebra. SIAM, 2023. [Google Scholar]
Mishchenko K, Malinovsky G, Stich S, and Richtárik P. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022. [Google Scholar]
Nedić A and Olshevsky A. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2014. [Google Scholar]
Nedic A and Ozdaglar A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [Google Scholar]
Nedic A, Olshevsky A, and Shi W. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017. [Google Scholar]
Nguyen EDH, Jiang X, Ying B, and Uribe CA. On graphs with finite-time consensus and their use in gradient tracking. SIAM Journal on Optimization, 35(2):872–898, 2025. [Google Scholar]
Pu S, Shi W, Xu J, and Nedić A. Push–pull gradient methods for distributed optimization in networks. IEEE Transactions on Automatic Control, 66(1):1–16, 2020. [Google Scholar]
Ryu EK and Yin W. Large-scale convex optimization: algorithms & analyses via monotone operators. Cambridge University Press, 2022. [Google Scholar]
Saadatniaki F, Xin R, and Khan UA. Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices. IEEE Transactions on Automatic Control, 65(11):4769–4780, 2020. [Google Scholar]
Sayed AH et al. Adaptation, learning, and optimization over networks. Foundations and Trends^® in Machine Learning, 7(4–5):311–801, 2014. [Google Scholar]
Shi W, Ling Q, Wu G, and Yin W. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [Google Scholar]
Shi Y, Shen L, Wei K, Sun Y, Yuan B, Wang X, and Tao D. Improving the model consistency of decentralized federated learning. In International Conference on Machine Learning, pages 31269–31291. PMLR, 2023. [Google Scholar]
Wang J, Liu Q, Liang H, Joshi G, and Poor HV. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020. [Google Scholar]
Wang L, Guo Y, Lin T, and Tang X. Delta: Diverse client sampling for fasting federated learning. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
Wang S and Ji M. A unified analysis of federated learning with arbitrary client participation. Advances in Neural Information Processing Systems, 35:19124–19137, 2022. [Google Scholar]
Wang S and Ji M. A lightweight method for tackling unknown participation statistics in federated averaging. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ZKEuFKfCKA [Google Scholar]
Xiang M, Ioannidis S, Yeh E, Joe-Wong C, and Su L. Efficient federated learning against heterogeneous and non-stationary client unavailability. Advances in Neural Information Processing Systems, 37:104281–104328, 2024. [Google Scholar]
Xin R and Khan UA. A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Systems Letters, 2(3):315–320, 2018. [Google Scholar]
Ying B, Yuan K, Chen Y, Hu H, Pan P, and Yin W. Exponential graph is provably efficient for decentralized deep training. Advances in Neural Information Processing Systems, 34:13975–13987, 2021. [Google Scholar]
Yuan K, Ling Q, and Yin W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [Google Scholar]
Yuan K, Ying B, Zhao X, and Sayed AH. Exact diffusion for distributed optimization and learning—part i: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS2161725-supplement-1.pdf^{(661.4KB, pdf)}

[R1] Assran M, Loizou N, Ballas N, and Rabbat M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pages 344–353. PMLR, 2019. [Google Scholar]

[R2] Beltrán ETM, Pérez MQ, Sánchez PMS, Bernal SL, Bovet G, Pérez MG, Pérez GM, and Celdrán AH. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 2023. [Google Scholar]

[R3] Cattivelli FS, Lopes CG, and Sayed AH. Diffusion recursive least-squares for distributed estimation over adaptive networks. IEEE Transactions on Signal Processing, 56(5):1865–1877, 2008. [Google Scholar]

[R4] Chen J and Sayed AH. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289–4305, 2012. [Google Scholar]

[R5] Defazio A, Bach F, and Lacoste-Julien S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014. [Google Scholar]

[R6] Fang M, Zhang Z, Hairi P. Khanduri, Liu J, Lu S, Liu Y, and Gong N. Byzantine-robust decentralized federated learning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 2874–2888, 2024. [Google Scholar]

[R7] Gu X, Huang K, Zhang J, and Huang L. Fast federated learning in the presence of arbitrary device unavailability. Advances in Neural Information Processing Systems, 34:12052–12064, 2021. [Google Scholar]

[R8] Horn RA and Johnson CR. Matrix analysis. Cambridge university press, 2012. [Google Scholar]

[R9] Huang X, Li P, and Li X. Stochastic controlled averaging for federated learning with communication compression. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jj5ZjZsWJe. [Google Scholar]

[R10] Jhunjhunwala D, Sharma P, Nagarkatti A, and Joshi G. Fedvarp: Tackling the variance due to partial client participation in federated learning. In Uncertainty in Artificial Intelligence, pages 906–916. PMLR, 2022. [Google Scholar]

[R11] Johnson R and Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013. [Google Scholar]

[R12] Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R, et al. Advances and open problems in federated learning. Foundations and trends^® in machine learning, 14(1–2):1–210, 2021. [Google Scholar]

[R13] Karimireddy SP, Kale S, Mohri M, Reddi S, Stich S, and Suresh AT. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020. [Google Scholar]

[R14] Koloskova A, Loizou N, Boreiri S, Jaggi M, and Stich S. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020. [Google Scholar]

[R15] Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. 2009.

[R16] Lalitha A, Shekhar S, Javidi T, and Koushanfar F. Fully decentralized federated learning. In Third workshop on bayesian deep learning (NeurIPS), volume 2, 2018. [Google Scholar]

[R17] Lan G, Lee S, and Zhou Y. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, 180(1):237–284, 2020. [Google Scholar]

[R18] Li X, Huang K, Yang W, Wang S, and Zhang Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJxNAnVtDS [Google Scholar]

[R19] Li Z, Shi W, and Yan M. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019. [Google Scholar]

[R20] Li Z, Nabavirazavi S, Ying B, Iyengar S, and Yang H. Fast: A lightweight mechanism unleashing arbitrary client participation in federated learning. In Proceedings of International Joint Conference on Artificial Intelligence, pages 5644–5652, 2025. URL 10.24963/ijcai.2025/628. [DOI] [Google Scholar]

[R21] Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems, 30, 2017. [Google Scholar]

[R22] McMahan B, Moore E, Ramage D, Hampson S, and y Arcas BA. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. [Google Scholar]

[R23] Meyer CD. Matrix analysis and applied linear algebra. SIAM, 2023. [Google Scholar]

[R24] Mishchenko K, Malinovsky G, Stich S, and Richtárik P. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022. [Google Scholar]

[R25] Nedić A and Olshevsky A. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2014. [Google Scholar]

[R26] Nedic A and Ozdaglar A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [Google Scholar]

[R27] Nedic A, Olshevsky A, and Shi W. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017. [Google Scholar]

[R28] Nguyen EDH, Jiang X, Ying B, and Uribe CA. On graphs with finite-time consensus and their use in gradient tracking. SIAM Journal on Optimization, 35(2):872–898, 2025. [Google Scholar]

[R29] Pu S, Shi W, Xu J, and Nedić A. Push–pull gradient methods for distributed optimization in networks. IEEE Transactions on Automatic Control, 66(1):1–16, 2020. [Google Scholar]

[R30] Ryu EK and Yin W. Large-scale convex optimization: algorithms & analyses via monotone operators. Cambridge University Press, 2022. [Google Scholar]

[R31] Saadatniaki F, Xin R, and Khan UA. Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices. IEEE Transactions on Automatic Control, 65(11):4769–4780, 2020. [Google Scholar]

[R32] Sayed AH et al. Adaptation, learning, and optimization over networks. Foundations and Trends^® in Machine Learning, 7(4–5):311–801, 2014. [Google Scholar]

[R33] Shi W, Ling Q, Wu G, and Yin W. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [Google Scholar]

[R34] Shi Y, Shen L, Wei K, Sun Y, Yuan B, Wang X, and Tao D. Improving the model consistency of decentralized federated learning. In International Conference on Machine Learning, pages 31269–31291. PMLR, 2023. [Google Scholar]

[R35] Wang J, Liu Q, Liang H, Joshi G, and Poor HV. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020. [Google Scholar]

[R36] Wang L, Guo Y, Lin T, and Tang X. Delta: Diverse client sampling for fasting federated learning. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]

[R37] Wang S and Ji M. A unified analysis of federated learning with arbitrary client participation. Advances in Neural Information Processing Systems, 35:19124–19137, 2022. [Google Scholar]

[R38] Wang S and Ji M. A lightweight method for tackling unknown participation statistics in federated averaging. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ZKEuFKfCKA [Google Scholar]

[R39] Xiang M, Ioannidis S, Yeh E, Joe-Wong C, and Su L. Efficient federated learning against heterogeneous and non-stationary client unavailability. Advances in Neural Information Processing Systems, 37:104281–104328, 2024. [Google Scholar]

[R40] Xin R and Khan UA. A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Systems Letters, 2(3):315–320, 2018. [Google Scholar]

[R41] Ying B, Yuan K, Chen Y, Hu H, Pan P, and Yin W. Exponential graph is provably efficient for decentralized deep training. Advances in Neural Information Processing Systems, 34:13975–13987, 2021. [Google Scholar]

[R42] Yuan K, Ling Q, and Yin W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [Google Scholar]

[R43] Yuan K, Ying B, Zhao X, and Sayed AH. Exact diffusion for distributed optimization and learning—part i: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018. [Google Scholar]

PERMALINK

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying

Zhe Li

Haibo Yang

Abstract

1. Introduction

2. Related Work

3. Graph, Stochastic Matrix, and Arbitrary Client Participation

Figure 1:

3.1. Arbitrary Client Participation Modeling

4. From Interpretation to Correction: A New Federated Optimization with Exact Convergence via Push-pull Strategy - `FOCUS`

4.1. Interpret FedAvg as Decentralized Algorithm with Time-Varying Graphs

Figure 2:

4.2. `FOCUS` Corrects Arbitrary Client Participation and Local-Update Bias

4.2.1. Push-Pull Strategy for FL Settings

Figure 3:

4.2.2. Convert Vector-Matrix Form Back To FL-Style Algorithm

5. Performance Analysis

Table 1:

Figure 4:

5.1. Why `FOCUS` Can Converge Exactly for Arbitrary Participation Probabilities?

5.2. Extension to Stochastic Gradients and ML Applications

6. Conclusion

Limitations

Future Works.

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Bicheng Ying

Zhe Li

Haibo Yang

Abstract

1. Introduction

2. Related Work

3. Graph, Stochastic Matrix, and Arbitrary Client Participation

Figure 1:

3.1. Arbitrary Client Participation Modeling

4. From Interpretation to Correction: A New Federated Optimization with Exact Convergence via Push-pull Strategy - FOCUS

4.1. Interpret FedAvg as Decentralized Algorithm with Time-Varying Graphs

Figure 2:

4.2. FOCUS Corrects Arbitrary Client Participation and Local-Update Bias

4.2.1. Push-Pull Strategy for FL Settings

Figure 3:

4.2.2. Convert Vector-Matrix Form Back To FL-Style Algorithm

5. Performance Analysis

Table 1:

Figure 4:

5.1. Why FOCUS Can Converge Exactly for Arbitrary Participation Probabilities?

5.2. Extension to Stochastic Gradients and ML Applications

6. Conclusion

Limitations

Future Works.

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4. From Interpretation to Correction: A New Federated Optimization with Exact Convergence via Push-pull Strategy - `FOCUS`

4.2. `FOCUS` Corrects Arbitrary Client Participation and Local-Update Bias

5.1. Why `FOCUS` Can Converge Exactly for Arbitrary Participation Probabilities?