Communication Efficient Tensor Factorization for Decentralized Healthcare Networks

Jing Ma; Qiuchen Zhang; Jian Lou; Li Xiong; Sivasubramanium Bhavani; Joyce C Ho

doi:10.1109/icdm51629.2021.00147

. Author manuscript; available in PMC: 2022 Nov 14.

Published in final edited form as: Proc IEEE Int Conf Data Min. 2022 Jan 24;2021:1216–1221. doi: 10.1109/icdm51629.2021.00147

Communication Efficient Tensor Factorization for Decentralized Healthcare Networks

Jing Ma ¹, Qiuchen Zhang ¹, Jian Lou ^1,², Li Xiong ¹, Sivasubramanium Bhavani ¹, Joyce C Ho ¹

PMCID: PMC9652777 NIHMSID: NIHMS1846243 PMID: 36382085

Abstract

Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks, but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with the communication reduction up to 99.99%.

Index Terms—: Tensor Factorization, Decentralized Optimization, Federated Learning, Communication efficient, EHRs

I. Introduction

The widespread adoption of EHR systems has facilitated the rapid accumulation of the patients’ clinical data from numerous medical institutions. Yet, successfully mining the massive, high-dimensional EHR data is a challenging task due to sparse, missing, and noisy measurements [1], [2]. Computational phenotyping is the process of mapping the high-dimensional EHR data into meaningful medical concepts, which characterize a patient’s clinical behavior and corresponding treatments. Tensor factorization has been proven as an efficient unsupervised learning approach to automatically extract phenotypes without the process of manual labeling [3]–[5].

Recently, federated tensor factorization [6]–[8] has been developed as a special distributed tensor factorization paradigm which not only parallelizes the tensor computation, but is also able to preserve the data privacy by distributing the horizontally partitioned tensors to multiple medical institutions to avoid direct data sharing, and aims to learn the shared phenotypes through joint tensor factorization without communicating the individual-level data. Moreover, with the participation of different data sources, federated tensor factorization also helps mitigate the bias of analyzing data from single source, and achieves better generalizability.

Under the federated learning settings, the central server is the most important computation resource as it is in charge of picking clients to communicate at each iteration, aggregating the clients’ intermediate results, and updating the global model. However, a single server might have several shortcomings: 1) limited connectivity and bandwidth, which restricts the server from collecting data from as many clients as possible; 2) vulnerability to malfunctions, which can cause inaccurate model updates, or even learning failures; and 3) exposure to external attacks and malicious adversaries, which can lead to sensitive information leakage. Therefore, traditional federated tensor factorization usually suffers from the bottleneck of the central server regarding the limited communication bandwidth and is exposed to high risk of single-point-failure. To avoid relying on the server as the only source of computation, decentralization has been proposed as a solution to this single-point-failure issue [9], [10]. Decentralized federated learning is designed without the participation of the central server, while each client will rely on its own computation resources and communicate only with its neighbors in a peer-to-peer manner. Besides the necessities of a decentralized communication topology, it is also worth noting that the network capacity between clients are usually much smaller than the datacenter in many real-world applications [11]. Therefore it is necessary that the clients communicate the model updates efficiently with limited cost.

In this paper, we study the decentralized optimization of tensor factorization under the horizontal data partition setting, and propose CiderTF, a Communication-effIcient DEcentralized geneRalized Tensor Factorization algorithm for collaborative analysis over a communication network. To enable more flexibility on choosing different loss functions under various scenarios, we extend the classic federated tensor factorization into a more generalized tensor factorization. To the best of our knowledge, this paper is the first one proposing a decentralized generalized tensor factorization, let alone considering the decentralized setting with communication efficiency. Our contributions are briefly summarized as follows.

First, we develop a decentralized tensor factorization framework which employs four levels of communication reduction strategies to the decentralized optimization of tensor factorization to reduce the communication cost over the communication network. Second, we further incorporate Nesterov’s momentum into the local updates of CiderTF and propose CiderTF_m, in order to achieve better generalization and faster convergence. Third, we conduct comprehensive experiments on both real-world and synthetic datasets to corroborate the theoretical communication reduction and the convergence of CiderTF. Experiment results demonstrate that CiderTF achieves comparable convergence performance with the communication reduction of 99.99%.

II. Preliminaries and Background

In this section, we summarize the frequently used definitions and notations. For a D-th order tensor $𝓧 \in ℝ^{I_{1} \times \dots \times I_{D}}$ , the tensor entry indexed by (i₁, …, i_D) is denoted by the MATLAB representation 𝓧(i₁, …, i_D). Let ℐ denote the index set of all tensor entries, $| ℐ | = I_{Π} = \prod_{d = 1}^{D} I_{d}$ . The mode-d unfolding (also called matricization) is denoted by $X_{〈 d 〉} \in ℝ^{I_{d} \times I_{Π / I_{d}}}$ . Detailed background knowledge can be found in [12].

Definition II.1. (MTTKRP). The MTTKRP operation stands for the matricized tensor times Khatri-Rao product. Given a tensor $𝓨 \in ℝ^{I_{1} \times \dots \times I_{D}}$ , its mode-d matricization is Y_<d>, [A₍₁₎, …, A_(D)] is the set of CP factor matrices. $H_{d} \in ℝ^{I_{Π} / I_{d} \times R}$ is defined as

H_{d} = A_{(D)} ⊙ \dots ⊙ A_{(d + 1)} ⊙ A_{(d - 1)} \dots ⊙ A_{(1)},

where ⊙ is the Khatri-Rao product. The MTTKRP operation can thus be defined as the matrix product between Y_<d> and H_d as Y_<d> · H_d.

Definition II.2. (GCP). Generalized CP (GCP) [13] extends the classic CP by using the element-wise loss function to support other loss functions. The objective function of GCP is

arg min_{𝓐} F (𝓐, 𝓧) = \sum_{i \in ℐ} f (𝓐 (i), 𝓧 (i)) s . t . 𝓐 = \sum_{i = 1}^{R} A_{(1)} (:, i) \circ \dots \circ A_{(D)} (:, i),

(1)

GCP not only preserves the low-rank constraints as CP decomposition, it also enjoys the flexibility of choosing different loss functions according to different data distributions by leveraging the elementwise objective function. For example, for data indexed by i ∈ ℐ with Gaussian distribution, we use least square loss to model it, which in turn yields the classic CP decomposition:

f_{square} (𝓐 (i), 𝓧 (i)) = {(𝓐 (i) - 𝓧 (i))}^{2} .

(2)

On the other hand, for binary data indexed by i ∈ ℐ, we can use Bernoulli-logit loss to fit it:

f_{logit} (𝓐 (i), 𝓧 (i)) = log (1 + 𝓐 (i)) - 𝓧 (i) 𝓐 (i) .

(3)

III. Proposed Method

A. Problem Formulation

In the decentralized tensor factorization setting, the communication topology is represented by an undirected graph 𝓖 = (𝓥, 𝓔), where 𝓥 ≔ {1, 2, …, K} denotes the set of clients participating in the communication network. Each node k in the graph represents a client. The neighbors of client k is denoted as 𝓝_k ≔ {(k, j) : (k, j) ∈ 𝓔}. There is a connectivity matrix $W \in ℝ^{K \times K}$ , the (k, j)-th entry w_kj ∈ [0, 1], ∀(k, j) ∈ 𝓔 in which denotes the weights of edge (k, j) ∈ 𝓔 and measures how much the client k is impacted by client j.

Each client in the decentralized communication graph will hold a local tensor 𝓧^k, which can be seen as the horizontal partition of a global tensor 𝓧. The aim for the decentralized federated learning is to jointly factorize the local tensors 𝓧^k to get the globally shared feature factor matrices A₍₂₎, …, A_(D), and the individual mode factor matrices $A_{(1)}^{k}$ from all clients. The objective function for the decentralized generalized tensor factorization is shown as

\begin{array}{l} \underset{(A_{(1)}, \dots, A_{(D)})}{argmin} \sum_{k = 1}^{K} F (𝓐, 𝓧^{k}), \\ s . t . 𝓐 = A_{(1)} \circ \dots \circ A_{(D)}, \end{array}

(4)

which can be further extended to other multiblock optimization problems which are not limited to tensor factorization [14].

B. `CiderTF`

1). Overview:

We propose CiderTF, a decentralized tensor factorization framework which achieves communication efficiency through four levels of communication reduction. At the element-level, we utilize sign compressor [15], [16] for gradient compression to reduce the number of bytes transmitted between clients by converting the partial gradient from the floating point representation to low-precision representation.

Definition III.1. (Sign Compressor) For an input tensor $x \in ℝ^{d}$ , its compression via Sign(·) is Sign(x) = ∥x∥₁/d · sign(x), where sign takes the sign of each element of x.

At the block-level, we apply the randomized block coordinate descent [17]–[19] for the factor updates, which only requires sampling one mode from all modes of a tensor for the update per round and communicating only one mode factor updates with the neighbors. At the round-level, we adopt a periodic communication strategy [20]–[22] to reduce the communication frequency by allowing each client to perform τ > 1 local update rounds before communicating with its neighbors. In addition, at the communication event-level, we apply an event-triggered communication strategy [23], [24] to boost the communication reduction at the round level.

The detailed algorithm is shown in Algorithm 1 with the key steps annotated. In CiderTF, each client k ∈ [K] maintains the local factor matrices $A_{(d)}^{k}$ from each mode d = 1, …, D. The goal is to achieve consensus on the feature mode factor matrices $A_{(d)}^{k}, \forall d = 2, \dots, D$ . Therefore, besides the local factor matrices, each client also need to maintain the estimation of the local factor matrices $A_{(d)}^{j}$ from both itself k and its neighbors 𝓝_k (j ∈ 𝓝_k ∪ k). The sequence of the randomized sampling blocks for every round t = 1, …, T is denoted as d_ξ[0], …, d_ξ[T]. At every round for the sampled block d_ξ[t], each client checks for the triggering condition for every τ iterations at the communication round (line 10). The triggering threshold is set to be λ[t]. When the difference between the updated factor and the local estimation is larger than the threshold, each client will send and receive the compressed updates to its neighbors. While if the triggering condition is not satisfied, then the clients will just communicate a matrix of zero instead (line 10–14). After receiving the compressed updates from all its neighbors, each client will first update the local estimation of the factor matrices ${\hat{A}}_{(d)}^{j} [t + 1], j \in 𝒩_{k} \cup k$ (line 16), and conduct the consensus step and update the local factors $A_{(d)}^{k} [t + 1]$ through the decentralized consensus step (line 18). At the non-communication round, each client will just keep updating the local factor matrices (line 6–7). For the rest of the blocks not selected, they will remain the same at the last round (line 20–22).

2). Optimization:

At each iteration, each client k first need to compute the GCP gradient as the partial derivative with regard to the factor matrix $A_{(d)}^{k}$ using the MTTKRP operator

\frac{\partial F (𝓐^{k}, 𝓧^{k})}{\partial A_{(d)}^{k} [t]} = Y_{〈 d 〉}^{k} H_{d}^{k},

(5)

where $H_{d}^{k}$ denotes the Khatri-Rao product of mode d of the factor matrices as is shown in definition II.1.

Fiber Sampling.

Computing the full gradient $\frac{\partial F (𝓐^{k}, 𝓧^{k})}{\partial A_{(d)}^{k} [t]}$ requires $O (R \prod_{d = 1}^{D} I_{d})$ time complexity and is the bottleneck of the gradient based optimization for tensor factorization, especially for EHR tensors where each dimension can be very large. Fiber sampling technique [18], [25] randomly samples |𝒮_d| fibers from mode d. This provides efficient formation of $Y_{〈 d 〉}^{k}$ as $Y_{〈 d 〉}^{k} (:, 𝒮_{d})$ and efficient computation of $H_{d}^{k}$ to only compute the Hadamard product (⊛) of the certain rows (s-th) of the factor matrices at time t as $H_{d}^{k} (s, :) = A_{(1)}^{k} (i_{1}^{s}, :) ⊛ \dots ⊛ A_{(d - 1)}^{k} (i_{d - 1}^{s}, :) ⊛ A_{(d + 1)}^{k} (i_{d + 1}^{s}, :) ⊛ \dots ⊛ A_{(D)}^{k} (i_{D}^{s}, :)$ (the row indices are obtained from the index mapping ${i_{1}^{s}, \dots, i_{D}^{s}}, s \in 𝒮_{d})$ . Therefore, we can use local partial stochastic gradient $G_{(d)}^{k} [t]$ as an unbiased estimation of the gradient $\frac{\partial F (𝓐^{k}, 𝓧^{k})}{\partial A_{(d)}^{k} [t]}$ , which is efficiently computed with the fiber sampling technique as

G_{(d)}^{k} [t] = Y_{〈 d 〉}^{k} (:, 𝒮_{d}) H_{d}^{k} (𝒮_{d}, :),

(6)

Block randomization.

We utilize the block randomization [18] to further improve the computation efficiency by randomly selecting a mode to update at each round. Specially for CiderTF, we always keep the patient mode (the 1-st mode) securely at local to avoid directly sharing patient related information, thus when d_ξ[t] = 1, we skip the communication of this round and only update the local patient mode factors. This not only improves the computation efficiency, but also reduces the communication cost at the block level.

C. `CiderTF_m`: `CiderTF` with Nesterov’s momentum

We further propose CiderTF_m with Nesterov’s momentum incorporated in the local SGD update step to speedup the convergence and achieve less total communication bits. After computing the partial stochastic gradient $G_{(d)}^{k} [t]$ (line 4), we update the momentum velocity component as

M_{(d)}^{k} [t] = G_{(d)}^{k} [t] + β \frac{η [t - 1]}{η [t]} M_{(d)}^{k} [t - 1]

(7)

where β is the momentum parameter. The intermediate factor matrix will be updated as

A_{(d)}^{k} [t + \frac{1}{2}] = A_{(d)}^{k} [t] - γ [t] (G_{(d)}^{k} [t] + β M_{(d)}^{k} [t])

(8)

D. Complexity Analysis

We analyze the complexity from the perspective of computation, communication, and memory cost. For computation complexity, the per-iteration computational complexity of CiderTF for each client is $O (\frac{1}{D} (\sum_{d = 1}^{D} I_{d}) R | 𝓢 |)$ . CiderTF reduces a lower bound of $1 - \frac{1}{32 D τ}$ communication. The total communication reduction is 99.99% compared with the full precision decentralized SGD based on experimental results. CiderTF has the memory complexity of $O (| 𝒮 | \frac{1}{D} \sum_{d = 1}^{D} I_{d})$ . Please refer to [12] for more detailed complexity analysis.

IV. Experiment

A. Experimental Settings

1). Datasets:

We conduct experiments on two real-world large volume, publicly available and de-identified datasets, MIMIC-III [26] and CMS [27], and a synthetic dataset with similar sparsity (see [12] for more detail). We follow the rules in [6] and select the top 500 diagnoses, procedures, and medications of the most frequently observed records to form the tensors with patient mode 34,272, 125,961, and 4000 for MIMIC-III, CMS, and Synthetic data, respectively.

2). Baselines:

We consider the following centralized tensor factorization baselines: i) GCP [28] as the baseline of generalized tensor factorization; ii) BrasCPD [18] as the computation efficient tensor factorization baseline; iii) Centralized CiderTF, CiderTF with K = 1 and error-feedback.

We also implement the decentralized version SGD under the non-convex settings as the decentralized baselines, since there is no existing decentralized tensor factorization framework. i) D-PSGD [10], [29] as a pure decentralized SGD version; ii) SPARQ-SGD [24] as a decentralized communication-efficient stochastic gradient descent baseline; iii) D-PSGDbras can be considered as D-PSGD with block randomization.

3). Parameter Settings:

Experiments are performed on two objective functions including Bernoulli-logit loss to fit the binary data (eq. 3) and Least Square Loss to fit the data with Gaussian distribution (eq. 2). We use a fixed learning rate γ[t], which is determined through searching the grid of powers of 2. We follow the rules in [24] to set the triggering threshold λ[t]. The detailed parameter settings, additional experiment results (more datasets, ablation study, etc.) can be found in [12].

B. Result Analysis

We form a decentralized communication topology as a ring, and have a default of eight workers with data horizontally partitioned and distributed evenly across all the eight clients.

1). Comparison to the Baselines:

From fig. 2, we have four major observations. I) CiderTF converges to comparable losses as the centralized baselines. These results empirically validate the convergence of CiderTF. II) CiderTF has less communication cost without sacrificing the convergence. CiderTF takes 99.99% less communication cost than D-PSGD, 75% less communication cost than SPARQ-SGD and 99.92% less than D-PSGDbras to achieve the same loss. III) CiderTF is computationally efficient. CiderTF is computationally efficient compared with GCP and D-PSGD (fig. 2) due to fiber sampling and block randomization. CiderTF is also slightly more efficient than BrasCPD thanks to the decentralized data distribution which helps parallelize the local tensor factorization. IV) Nesterov’s momentum can offer CiderTF_m faster convergence, leading to less overall communication cost. CiderTF_m requires less epochs to converge (fig. 2), which in turn reduce the total communication bytes with little sacrifice of the accuracy.

2). Impact of Topology:

We test CiderTF on ring topology and star topology with the same number of workers (fig. 1). From fig. 3, we observe that different topologies do not affect the convergence, which means that CiderTF can generalize to different kinds of communication topologies. Fig. 3 also illustrates that two topologies enjoy similar computation time due to the same number of workers, while star topology has less communication cost because the total degree of the star topology is less than the ring topology.

3). Scalability:

Moreover, we test the scalability of CiderTF. By increasing the number of clients from K = 8 to K = 16 and K = 32, we observe linear scalability in the computation time (fig. 4 left) without sacrificing the accuracy. However, as the number of clients increases, the communication cost will increase accordingly (fig. 4 right). Therefore, there exists a computation-communication trade-off when increasing the number of clients involved in the decentralized tensor factorization framework.

Fig. 4. — Bernoulli-logit loss with respect to time and communication for MIMIC-III data with 8, 16, and 32 workers for local update rounds τ = 4, 8.

C. Case Study on MIMIC-III

We conduct a case study on MIMIC-III to evaluate the extracted phenotypes from both quantitative and qualitative perspectives. From the quantitative aspect, we use the Factor Match Score (FMS) [30] to measure the similarity of the factor matrices of CiderTF with BrasCPD. FMS ranges from 0 to 1 with the best possible value of 1. Fig. 5 indicates that CiderTF achieves comparable FMS as the baselines with much less computation time and communication cost.

Fig. 5. — Factor Match Scores (FMS) with respect to time and communication.

From the the qualitative perspective, we evaluate the quality of the phenotypes by patient subgroup identification ability. Following the precedent set in [5], we first identify the top three phenotypes according to the phenotype importance factor λ_r. We then group the patients by assigning each according to the largest value among the top 3 along the patient representation vector, and use tSNE to map the patient representation into two-dimensional space. Fig. 2 shows that CiderTF (τ = 8) achieves comparable patient subgroup identification ability as the centralized baseline BrasCPD. While with the same communication cost, CiderTF achieves better clustered subgroups than the decentralized baseline SPARQ-SGD. In addition, the top 3 phenotypes extracted by CiderTF (table III) are clinically meaningful and interpretable as annotated by a pulmonary and critical care physician.

TABLE III.

phenptypes extracted by CiderTF (τ = 8) on MIMIC-III data. Dx, Px, and Med indicate diagnoses, procedures, and medication.

P1: Acute myocardial infarction
Dx	Other and unspecified angina pectoris Coronary atherosclerosis of autologous vein bypass graft Old myocardial infarction
Px	(Aorto)coronary bypass of two coronary arteries (Aorto)coronary bypass of three coronary arteries Implant of pulsation balloon
Med	Diltiazem Hydrochloride Extended-Release Metoprolol succinate, Rosuvastatin Calcium Valsartan/hydrochlorothiazide, Losartan Potassium
P2: Respiratory failure
Dx	Acute respiratory failure, Hypoxemia, Contusion of lung without mention of open wound into thorax Disruption of internal operation (surgical) wound
Px	Non-invasive mechanical ventilation Continuous invasive mechanical ventilation for less than 96 consecutive hours
Med	Dextrose, Albuminar-25, Plasmanate
P3: Intracranial hemorrhage or cerebral infarction
Dx	Pure hypercholesterolemia, Subdural hemorrhage Cerebral artery occlusion
Px	Injection or infusion of thrombolytic agent Control of hemorrhage
Med	Ticagrelor, Atorvastatin Calcium

Open in a new tab

V. Conclusion

In this paper, we propose CiderTF, which is the first decentralized generalized tensor factorization framework. It employs aggressive communication reduction techniques and maintains low computational and memory complexity without sacrificing the accuracy. Experiments show that CiderTF preserves the quality of the extracted phenotypes and converges to similar points as the decentralized SGD baselines with theoretical guarantees. Future works include developing asynchronized communication and variance reduced techniques to the decentralized paradigm.

TABLE I.

Symbols and notations used in this paper

Symbol	Definition
x, X, 𝓧	Vector, Matrix, Tensor
𝓧_<d>	Mode-d matricization of 𝓧
\|\| · \|\|₁	𝓁₁-norm
\|\| · \|\|_F	Frobenius norm
⊛	Hadamard (element-wise) multiplication
⊙	Khatri Rao product
○	Outer product
<·,·>	Inner product

Open in a new tab

TABLE II.

tSNE visualization of the patient subgroup identification with the extracted phenotypes. Each point represents a patient which is colored according to the highest-valued coordinate in the patient representation vector among the top 3 phenotypes extracted based on the factor weights λ_r = ∥A₍₁₎(:, r)∥_F∥A₍₂₎(:, r)∥_F ⋯∥A_(D)(:, r)∥_F.

graphic file with name nihms-1846243-t0002.jpg

Open in a new tab

Acknowledgment

This work was supported by the National Science Foundation under award IIS-#1838200, CNS-2124104 and CNS-1952192, National Institute of Health (NIH) under award number R01LM013323, K01LM012924 and R01GM118609, CTSA Award UL1TR002378.

References

[1].Miotto R, Li L, Kidd BA, and Dudley JT, “Deep patient: an unsupervised representation to predict the future of patients from the electronic health records,” Scientific reports, vol. 6, no. 1, pp. 1–10, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Weiskopf NG and Weng C, “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research,” JAMIA, vol. 20, no. 1, pp. 144–151, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Ho JC, Ghosh J, and Sun J, “Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization,” in Proceedings of the 20th ACM SIGKDD, 2014, pp. 115–124. [Google Scholar]
[4].Wang Y, Chen R, Ghosh J, Denny JC, Kho A, Chen Y, Malin BA, and Sun J, “Rubik: Knowledge guided tensor factorization and completion for health data analytics,” in Proceedings of the 21th ACM SIGKDD, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Perros I, Papalexakis EE, Wang F, Vuduc R, Searles E, Thompson M, and Sun J, “Spartan: Scalable parafac2 for large & sparse data,” in Proceedings of the 23rd ACM SIGKDD, 2017, pp. 375–384. [Google Scholar]
[6].Kim Y, Sun J, Yu H, and Jiang X, “Federated tensor factorization for computational phenotyping,” in Proceedings of the 23rd ACM SIGKDD, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Ma J, Zhang Q, Lou J, Ho JC, Xiong L, and Jiang X, “Privacy-preserving tensor factorization for collaborative health data analysis,” in Proceedings of the 28th ACM CIKM, 2019, pp. 1291–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Ma J, Zhang Q, Lou J, Xiong L, and Ho JC, “Communication efficient federated generalized tensor factorization for collaborative health data analytics,” in Proceedings of the Web Conference 2021, 2021, pp. 171–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Li J, Shao Y, Ding M, Ma C, Wei K, Han Z, and Poor HV, “Blockchain assisted decentralized federated learning (blade-fl) with lazy clients,” arXiv preprint arXiv:2012.02044, 2020. [Google Scholar]
[10].Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1705.09056, 2017. [Google Scholar]
[11].Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, and Varghese G, “Global analytics in the face of bandwidth and regulatory constraints,” in 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15), 2015, pp. 323–336. [Google Scholar]
[12].Ma J, Zhang Q, Lou J, Xiong L, Bhavani S, and Ho JC, “Communication efficient tensor factorization for decentralized healthcare networks,” arXiv preprint arXiv:2109.01718, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Hong D, Kolda TG, and Duersch JA, “Generalized canonical polyadic tensor decomposition,” arXiv preprint arXiv:1808.07452, 2018. [Google Scholar]
[14].Zeng J, Lau TT-K, Lin S, and Yao Y, “Global convergence of block coordinate descent in deep learning,” in ICML, 2019, pp. 7313–7323. [Google Scholar]
[15].Stich SU and Karimireddy SP, “The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication,” arXiv preprint arXiv:1909.05350, 2019. [Google Scholar]
[16].Stich SU, Cordonnier J-B, and Jaggi M, “Sparsified sgd with memory,” in NeurIPS, 2018, pp. 4447–4458. [Google Scholar]
[17].Beck A and Tetruashvili L, “On the convergence of block coordinate descent type methods,” SIAM journal on Optimization, vol. 23, no. 4, pp. 2037–2060, 2013. [Google Scholar]
[18].Fu X, Ibrahim S, Wai H-T, Gao C, and Huang K, “Block-randomized stochastic proximal gradient for low-rank tensor factorization,” IEEE Transactions on Signal Processing, vol. 68, pp. 2170–2185, 2020. [Google Scholar]
[19].Nesterov Y, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM Journal on Optimization, vol. 22, no. 2, pp. 341–362, 2012. [Google Scholar]
[20].Stich SU, “Local sgd converges fast and communicates little,” in ICLR, 2018. [Google Scholar]
[21].Lin T, Stich SU, Patel KK, and Jaggi M, “Don’t use large mini-batches, use local sgd,” arXiv preprint arXiv:1808.07217, 2018. [Google Scholar]
[22].Basu D, Data D, Karakus C, and Diggavi S, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification, and local computations,” in NeurIPS, 2019. [Google Scholar]
[23].Du W, Yi X, George J, Johansson KH, and Yang T, “Distributed optimization with dynamic event-triggered mechanisms,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 969–974. [Google Scholar]
[24].Singh N, Data D, George J, and Diggavi S, “Sparq-sgd: Event-triggered and compressed communication in decentralized optimization,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 3449–3456. [Google Scholar]
[25].Battaglino C, Ballard G, and Kolda TG, “A practical randomized cp tensor decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 39, no. 2, pp. 876–901, 2018. [Google Scholar]
[26].Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, p. 160035, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27]. https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.
[28].Kolda TG and Hong D, “Stochastic gradients for large-scale tensor decomposition,” arXiv preprint arXiv:1906.01687, 2019. [Google Scholar]
[29].Koloskova A, Loizou N, Boreiri S, Jaggi M, and Stich S, “A unified theory of decentralized sgd with changing topology and local updates,” in ICML. PMLR, 2020, pp. 5381–5393. [Google Scholar]
[30].Acar E, Dunlavy DM, Kolda TG, and Mørup M, “Scalable tensor factorizations for incomplete data,” Chemometrics and Intelligent Laboratory Systems, vol. 106, no. 1, pp. 41–56, 2011. [Google Scholar]

[R1] [1].Miotto R, Li L, Kidd BA, and Dudley JT, “Deep patient: an unsupervised representation to predict the future of patients from the electronic health records,” Scientific reports, vol. 6, no. 1, pp. 1–10, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Weiskopf NG and Weng C, “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research,” JAMIA, vol. 20, no. 1, pp. 144–151, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Ho JC, Ghosh J, and Sun J, “Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization,” in Proceedings of the 20th ACM SIGKDD, 2014, pp. 115–124. [Google Scholar]

[R4] [4].Wang Y, Chen R, Ghosh J, Denny JC, Kho A, Chen Y, Malin BA, and Sun J, “Rubik: Knowledge guided tensor factorization and completion for health data analytics,” in Proceedings of the 21th ACM SIGKDD, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Perros I, Papalexakis EE, Wang F, Vuduc R, Searles E, Thompson M, and Sun J, “Spartan: Scalable parafac2 for large & sparse data,” in Proceedings of the 23rd ACM SIGKDD, 2017, pp. 375–384. [Google Scholar]

[R6] [6].Kim Y, Sun J, Yu H, and Jiang X, “Federated tensor factorization for computational phenotyping,” in Proceedings of the 23rd ACM SIGKDD, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Ma J, Zhang Q, Lou J, Ho JC, Xiong L, and Jiang X, “Privacy-preserving tensor factorization for collaborative health data analysis,” in Proceedings of the 28th ACM CIKM, 2019, pp. 1291–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Ma J, Zhang Q, Lou J, Xiong L, and Ho JC, “Communication efficient federated generalized tensor factorization for collaborative health data analytics,” in Proceedings of the Web Conference 2021, 2021, pp. 171–182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Li J, Shao Y, Ding M, Ma C, Wei K, Han Z, and Poor HV, “Blockchain assisted decentralized federated learning (blade-fl) with lazy clients,” arXiv preprint arXiv:2012.02044, 2020. [Google Scholar]

[R10] [10].Lian X, Zhang C, Zhang H, Hsieh C-J, Zhang W, and Liu J, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1705.09056, 2017. [Google Scholar]

[R11] [11].Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, and Varghese G, “Global analytics in the face of bandwidth and regulatory constraints,” in 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15), 2015, pp. 323–336. [Google Scholar]

[R12] [12].Ma J, Zhang Q, Lou J, Xiong L, Bhavani S, and Ho JC, “Communication efficient tensor factorization for decentralized healthcare networks,” arXiv preprint arXiv:2109.01718, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Hong D, Kolda TG, and Duersch JA, “Generalized canonical polyadic tensor decomposition,” arXiv preprint arXiv:1808.07452, 2018. [Google Scholar]

[R14] [14].Zeng J, Lau TT-K, Lin S, and Yao Y, “Global convergence of block coordinate descent in deep learning,” in ICML, 2019, pp. 7313–7323. [Google Scholar]

[R15] [15].Stich SU and Karimireddy SP, “The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication,” arXiv preprint arXiv:1909.05350, 2019. [Google Scholar]

[R16] [16].Stich SU, Cordonnier J-B, and Jaggi M, “Sparsified sgd with memory,” in NeurIPS, 2018, pp. 4447–4458. [Google Scholar]

[R17] [17].Beck A and Tetruashvili L, “On the convergence of block coordinate descent type methods,” SIAM journal on Optimization, vol. 23, no. 4, pp. 2037–2060, 2013. [Google Scholar]

[R18] [18].Fu X, Ibrahim S, Wai H-T, Gao C, and Huang K, “Block-randomized stochastic proximal gradient for low-rank tensor factorization,” IEEE Transactions on Signal Processing, vol. 68, pp. 2170–2185, 2020. [Google Scholar]

[R19] [19].Nesterov Y, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM Journal on Optimization, vol. 22, no. 2, pp. 341–362, 2012. [Google Scholar]

[R20] [20].Stich SU, “Local sgd converges fast and communicates little,” in ICLR, 2018. [Google Scholar]

[R21] [21].Lin T, Stich SU, Patel KK, and Jaggi M, “Don’t use large mini-batches, use local sgd,” arXiv preprint arXiv:1808.07217, 2018. [Google Scholar]

[R22] [22].Basu D, Data D, Karakus C, and Diggavi S, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification, and local computations,” in NeurIPS, 2019. [Google Scholar]

[R23] [23].Du W, Yi X, George J, Johansson KH, and Yang T, “Distributed optimization with dynamic event-triggered mechanisms,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 969–974. [Google Scholar]

[R24] [24].Singh N, Data D, George J, and Diggavi S, “Sparq-sgd: Event-triggered and compressed communication in decentralized optimization,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 3449–3456. [Google Scholar]

[R25] [25].Battaglino C, Ballard G, and Kolda TG, “A practical randomized cp tensor decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 39, no. 2, pp. 876–901, 2018. [Google Scholar]

[R26] [26].Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, p. 160035, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27]. https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.

[R28] [28].Kolda TG and Hong D, “Stochastic gradients for large-scale tensor decomposition,” arXiv preprint arXiv:1906.01687, 2019. [Google Scholar]

[R29] [29].Koloskova A, Loizou N, Boreiri S, Jaggi M, and Stich S, “A unified theory of decentralized sgd with changing topology and local updates,” in ICML. PMLR, 2020, pp. 5381–5393. [Google Scholar]

[R30] [30].Acar E, Dunlavy DM, Kolda TG, and Mørup M, “Scalable tensor factorizations for incomplete data,” Chemometrics and Intelligent Laboratory Systems, vol. 106, no. 1, pp. 41–56, 2011. [Google Scholar]

PERMALINK

Communication Efficient Tensor Factorization for Decentralized Healthcare Networks

Jing Ma

Qiuchen Zhang

Jian Lou

Li Xiong

Sivasubramanium Bhavani

Joyce C Ho

Abstract

I. Introduction

II. Preliminaries and Background

III. Proposed Method

A. Problem Formulation

B. CiderTF

1). Overview:

2). Optimization:

Fiber Sampling.

Block randomization.

C. CiderTF_m: CiderTF with Nesterov’s momentum

D. Complexity Analysis

IV. Experiment

A. Experimental Settings

1). Datasets:

2). Baselines:

3). Parameter Settings:

B. Result Analysis

1). Comparison to the Baselines:

Fig. 2.

2). Impact of Topology:

Fig. 1.

Fig. 3.

3). Scalability:

Fig. 4.

C. Case Study on MIMIC-III

Fig. 5.

TABLE III.

V. Conclusion

TABLE I.

TABLE II.

Acknowledgment

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

B. `CiderTF`

C. `CiderTF_m`: `CiderTF` with Nesterov’s momentum