Abstract
Federated learning of data from multiple participating parties is getting more attention and has many healthcare applications. We have previously developed VERTIGO, a distributed logistic regression model for vertically partitioned data. The model takes advantage of the linear separation property of kernel matrices of a dual space model to harmonize information in a privacy-preserving manner. However, this method does not handle the variance estimation and only provides point estimates: it cannot report test statistics and associated P-values. In this work, we extend VERTIGO by introducing a novel ring-structure protocol to pass on intermediary statistics among clients and successfully reconstructed the covariance matrix in the dual space. This extension, VERTIGO-CI, is a complete protocol to construct a logistic regression model from vertically partitioned datasets as if it is trained on combined data in a centralized setting. We evaluated our results on synthetic and real data, showing the equivalent accuracy and tolerable performance overhead compared to the centralized version. This novel extension can be applied to other types of generalized linear models that have dual objectives.
Introduction
With the of adoption of electronic health records (EHRs) in the US and advances in health information technology (HIT), a vast amount of health data is being generated rapidly. These data come from different sources (e.g., hospitals, cohort studies, disease registries, health insurance providers, and DNA/RNA sequencers). The conventional solution is first to gather datasets from multiple sources at a central site and then conduct analyses to answer a clinical/research question. However, such a centralized approach is not always viable because of potential harm to patient privacy, regulations, and policies, mistrust among participants, etc. If analyses could be conducted with data that are maintained in different places, this would greatly mitigate these factors.
A dataset can be partitioned in two ways: horizontally or vertically. Datasets are horizontally partitioned if all participating sites have the same set of features from different individuals. For example, a risk score model for coronary heart disease collects demographic, cholesterol, blood pressure, diabetes and smoking status from different institutions to develop or validate the model. Horizontally partitioned datasets1 occur in multi-site clinical trials, clinical data research networks (CDRNs), registries, and risk prediction models with non-overlapping development and validation sites2. On the other hand, a dataset can be partitioned vertically in two or more different features from the same individual and the subset of features of it can be stored in different sites. For example, a Strong Heart Study, the largest epidemiology study of cardiovascular disease in American Indians, store the the genotype data in one institution and the phenotype data in another institution, allowing access only to approved researchers3. Booming direct-to-consumer (DTC) genetic testing4 companies keep the individual's genetic data in their storage server, but the clinical information of the patients are stowed in the patient registry or EHR system. However, the association test of these genetic data can be performed only when they are linked with phenotypes, typically EHRs, thus physically separated from the genotype data. While healthcare claims data are saved in health insurance companies, detailed patient data are located in hospital EHR system. Current protocol of data access involves a lengthy process of request, review, approval, and monitoring limiting the opportunity of clinical research. Even when datasets can be centralized, transferring these to a central site is not trivial with genomic, imaging, and patient-generated health data from mobile phones and devices because these datasets can be very large in size. For this reason, commercial cloud computing platforms provide commonly used public genomics datasets such as 1000 genome5 or The Cancer Genomics Atlas (TCGA)6 datasets so that users can bypass the redundant transfer of such large datasets, which are costly and choke up the network.
Many algorithms were developed for federated analytics for both horizontal and vertically partitioned datasets1,7,8. For vertically partitioned datasets, secure matrix product algorithms are widely adopted9-11. None of these methods used dual optimization to perform interval estimation for vertically partitioned datasets. Dual optimization has been used for support vector machine classifier12, but the logistic regression model is preferred method in genetics. VERTIcal Grid lOgistic regression (VERTIGO) is a distributed algorithm to build a logistic regression model on vertically partitioned datasets using dual optimization13. However, VERTIGO provides the only point-estimates, so no confidence interval is provided, and the statistical significance of the estimate in the form of a P-value is not provided. This study is an extension of our previous work, namely VERTIGO, to add standard errors to derive the interval estimates and express the parameter's statistical significance. This paper introduces a novel way of generating and transmitting confidence intervals along with coefficients. We describe our proposed algorithm, provide the mathematical proof, and demonstrate the algorithm performance on both simulated and real datasets.
Methods
Synthetic data generation
Synthetic data for 2000 samples and 20 features were created with some distributional assumptions, as follows:
Generate two independent matrices, X1 and X2, of the dimension 2000 examples × 20 features, using a Uniform[0, 1] distribution
Derive a linear combination, X = 1 + 2X1 + 3X2, of the above two matrices
Generate random ground truth parameter vector β with size (20 × 1) using a Uniform[0, 1] distribution
Apply the sigmoid function to calculate the probabilities for a binary outcome,
Generate the binary outcome y with probability p in step 4 using a Bernoulli distribution
Then the generated samples were assigned to mutually exclusive partitions, where the number of partitions, k, was varied from 2 to 4 and each partition represented a client site.
Real data BURN1000
A synthetic data about a burn study was obtained from R package aplore3. It is included in a companion data archive for the textbook by Hosmer and Lemeshow14. The burn data had eight variables and 1000 samples. The outcome was death, a binary variable of alive or dead. The seven features were age, gender, race, burn facility, total burn surface area, burn involved in inhalation injury, and flame involved in a burn injury.
PennCath
A real data was from the Foulkes lab (http://www.stat-gen.org/), and this is the PennCATH cohort data, which arises from a Genome-wide association (GWA) study of coronary artery disease (CAD) and cardiovascular risk factors based at the University of Pennsylvania Medical Center15. First of all, the quality control process is performed on the genotype data to check sex discrepancy, minor allele frequency, Hardy-Weinberg equilibrium, and relatedness. In the end, the sample size of the data shrinks from 3850 to 1280. Then the whole dataset was split into two clients, phenotype and genotype. The binary outcome is the disease condition, yes or no. The phenotype data includes age, sex, and additional covariates for each individual, while the genotype data contains 10 principal components for SNPs. Those 10 components along with phenotypes data and one genotype data in 1,000 SNPs, will be put into the VERTIGO-CI algorithm. To evaluate the computation time, we designed studies for 3 batches of trials using 10, 100, and 1000 SNPs.
Model
The logistic model is defined as
where y is a binary outcome, X is the design matrix of sample-by-feature, and β is the model parameter. The goal is to find the estimate for β given observed data X and y. The best estimate for β is the maximizer of the log-likelihood function
where π is the sigmoid function and is the regularization penalty term to avoid overfitting. Since the above equation cannot be used for a vertically partitioned dataset in its current form, VERTIGO algorithm adopts reparame-terization using the dual form of the original optimization equation
This dual form of the maximum likelihood function is generating the same results by optimizing dual parameters with respect to samples rather than features, keeping the information intact16. The next step is to update the parameters a using Newton's method17 by iterating
where and are, respectively, the first and second derivative of dual object function , defined as
Note that H, the Hessian matrix, has been changed in this situation for calculation convenience, and such changes will not harm the convergence as it only changes the step size18. C is a positive constant that enables the Hessian matrix to be full rank so its inverse matrix exists. When dual parameters α converges, the desired primal form parameter vector β can be obtained by its relationship to α,
This study's novel contribution is producing the standard errors of the point estimates that can be used to report statistical significance by P-Values or confidence intervals. The standard error of the coefficient can be represented as
| (1) |
with the setting of vertically partitioned assumption on X, we have X = (X1, X2, … , Xk).
Since V is not separable for its own, the intermediate-term can be used to calculate V, by sending each term to clients, so that the final matrix V can be computed. Additionally, V should not be known by the center server because the information of X can be reverse-engineered with using previously seen data. So, at this step, the matrix V must be kept secret from the server.
The first connected client to the closed network acts as a lead-client and collects the first intermediate matrix, , from the other clients. This lead-client generates V and sends it back to all clients. Finally, each client sends the second intermediate matrix, XiV1/2, back to the server. Since the matrix V is hidden to the server, the individual-level data are protected. Since is separable as follows
| (2) |
where k is the number of clients, directly interpretable statistics such as the Z score can be calculated as , and confidence intervals and P-values can be derived. The pseudo-code is presented in Algorithm 1. Since has a different size, the problem turns into a 'puzzle solving' to update the partial block matrices. Thus, putting those matrices in the right places is important. See the matrices-puzzle-solving pseudo-code in Algorithm 2. As an example, when k = 3, the algorithm will be executed as shown in Figure 1. 'Row_Block i' is defined as binding k matrices column-wise where k is the number of clients.
Figure 1:
Example for 3 clients VERTIGO-CI matrices puzzle combination. Here the dimensions of X1, X2, X3, V are , and n × n where n is the number of patients and pi is the number of variables in the client i. And p = p1 + p2 + p3 is the total number of variables. 'Row_Block i' is defined as binding k matrices column-wise where k is the number of clients.
Algorithm 1 VERTIGO-CI

Implementation
We implemented the VERTIGO-CI in Python 3.7, using the numpy, pandas, and scipy modules to perform the mathematical computations. We utilized the asyncio module for network programming to allow asynchronous operations. All testing was performed on Amazon Web Service (AWS) EC2 instance of r5a.2xlarge (64 GB Memory, 8 CPUs) with Ubuntu 18.04 instances in different data centers in five continents (Asia: Seoul, Australia: Sydney, Europe: Dublin, North America: Oregon/Virginia, and South America: Sao Paulo).
Algorithm 2 Matrices-puzzle-solving

Results
The proposed method's correctness is reported in Table 1 using the the maximum absolute distance from the ground truth of the 20 estimates for 20 features from the synthetic dataset. All 20 coefficients specified in the simulation model achieved the near-perfect agreements. The runtime of the proposed method increased exponentially with an increase in the sample size and an increase in the number of clients(Figure 2). The runtime increased slightly when the number of features was increased. The effect of physical distance among clients and the server was evaluated using different cloud service providers' data centers. Six different Amazon Web Service (AWS) data centers were selected to co-locate all four clients, while keeping the server in Virginia, US. All four clients were scattered in four different places (blue line), which took the longest execution time. As a baseline (pink), the co-location of all four clients and the central server in one data center (Virginia) achieved the shortest computation time. From Dublin to Sydney, a remote data center was tested to observe the effect of the client data center's physical distance from the server data center, Virginia. Interestingly, trans-US (Oregon - Virginia) took longer run times than trans-Atlantic (Dublin - Virginia) or trans-America (Sao Paulo - Virginia). The reasons may lay on multiple jump boxes along with the connection between Oregon and Virginia, while the submarine cables are connected directly. In BURN1000 (the first real dataset), VERTIGO-CI achieved the near-perfect agreement between the estimates and the ground truth (Table 2). Its average runtime varied between 12 and 15 seconds, with the number of clients ranging from 2 to 4 (Table 3). In PENNCATH (the second real dataset), the proposed method showed a good agreement between the federated and centralized coefficient estimates (Table 4). However, the estimated difference in standard error was the one order of magnitude larger than for the coefficient.The runtime increased linearly with the increase in the number of SNPs, and the mean running time for each trial can be seen in Table 5.
Table 1: The difference in parameter estimates in synthetic data. The difference was measure in the L∞ norm, the maximum absolute distance from the ground truth of the 20 estimates. The dataset had 2000 samples, and 20 features were used.
| Number of Clients | Difference in Coefficient | Difference in Std Error |
| 2 | 1.34 × 10-6 | 5.31 × 10-8 |
| 3 | 1.34 × 10-6 | 5.22 × 10-8 |
| 4 | 1.34 × 10-6 | 5.34 × 10-8 |
Figure 2:
Computation time of the synthetic data. The time includes intermediate file transfer in two ways, client-to-client and client-to-server. A: Both sample size and number of clients varied under a fixed number of features = 20. B: Both feature and client numbers varied under a fixed sample size = 2000. C: Runtime by different AWS data centers. The blue line represents the run time of all four clients scattered in four different data centers away from Virginia, where server is located. The other colors represent the two data centers, one for co-locating all four clients and the other for the server site.
Table 2: Accuracy of VERTIGO-CI in BURN1000 data. total burn space area (TBSA 0-100 %), flame involved in burn injury (flame), burn involved in inhalation injury (inh inj), and Standard Error (SE).
| Variable | Coeff | Difference in Coeff | SE | Difference in SE |
| Intercept | -3.819841 | 4.978316e-07 | 0.296338 | -5.805270e-08 |
| facility | -0.176201 | -3.277626e-07 | 0.139130 | -3.553973e-08 |
| age | 2.075578 | 3.407076e-08 | 0.217424 | -9.323797e-09 |
| tbsa | 1.741145 | -5.004354e-08 | 0.179537 | -1.389442e-08 |
| gender_male | -0.069838 | -2.018457e-08 | 0.142060 | -9.766352e-09 |
| race_white | -0.347684 | -3.992930e-08 | 0.153023 | -7.573084e-09 |
| inhalation_injury | 0.439069 | 7.644087e-08 | 0.118723 | -1.191069e-08 |
| flame_involved | 0.291130 | -1.952836e-07 | 0.178000 | -2.107444e-08 |
Table 3: The runtime in BURN1000 data with varied number of clients.
| Number of Clients | Mean running time (s) |
| 2 | 12.4515 |
| 3 | 14.1357 |
| 4 | 15.9227 |
Table 4: Accuracy of VERTIGO-CI in PENNCATH data. high-density lipoprotein (hdl), low-density lipoprotein (ldl), principal component (pc), tryglyceride (tg), and Standard Error (SE).
| Variable | Coeff | Difference in Coeff | SE | Difference in SE |
| sex | -1.200262 | 2.007786e-07 | 0.005065 | 1.391678e-01 |
| age | -0.032013 | 1.330223e-06 | 0.144231 | 1.393521e-01 |
| tg | 0.011913 | 1.658586e-08 | 0.001869 | 7.267426e-04 |
| hdl | 0.015559 | 2.796433e-07 | 0.004879 | 1.855911e-04 |
| ldl | 0.006471 | 8.162119e-07 | 0.001142 | 7.268230e-04 |
| pc1 | 1.057632 | 6.040747e-06 | 2.405702 | 3.457518e-05 |
| pc2 | -3.234316 | 1.851210e-05 | 2.378695 | 1.801008e-02 |
| pc3 | -2.172853 | 1.293278e-05 | 2.404689 | 2.596270e-02 |
| pc4 | -1.136879 | 7.012701e-06 | 2.392821 | 1.190317e-02 |
| pc5 | 1.449743 | 8.630703e-06 | 2.408969 | 1.611641e-02 |
| pc6 | 0.060668 | 8.945382e-08 | 2.401000 | 8.005713e-03 |
| pc7 | 2.508987 | 1.462683e-05 | 2.424523 | 2.348583e-02 |
| pc8 | -3.037303 | 1.840741e-05 | 2.449719 | 2.515850e-02 |
| pc9 | -2.629828 | 1.576750e-05 | 2.422779 | 2.698205e-02 |
| pc10 | -0.983910 | 6.774366e-06 | 2.396671 | 2.614376e-02 |
Table 5: The runtime with PENNCATH data with varied number of SNPs.
| Number of SNPs | Mean runtime (s) | Standard deviation of runtime |
| 10 | 260.2662 | 51.4977 |
| 100 | 2625.8944 | 13.6653 |
| 1000 | 26159.9742 | 0.6426 |
Discussion
We proposed a novel method of embedding the client-to-client part to enhance the interpretation of VERTIGO with hypothesis statistics like standard error, Z-score, p-value as well as confidence intervals for each coefficients. Using both synthetic and real datasets, we demonstrated the correctness of VERTIGO-CI by showing that its estimates are identical to those from the logistic regression with acceptable runtime with a small to mid-size number of features. Our proposed method's novel contribution is the standard error of the point estimates, which allows statistical decisions using P-value and confidence intervals. As the previous VERTIGO implied, the implementation of a fixed-Hessian matrix on Newton's method can highly reduce the computation complexity. However, the inversion of fixed-Hessian matrix is still non-trivial. And another potential problem is the size of gram-matrix during communication, gram matrix with 10, 000 × 10, 000 size can take up to 60 GB size. We have successfully implemented our VERTIGO-CI on a server in different sites but there is still a room for improvement in runtime to handle a very large number of features as in genomics data.
Contributors
JK designed the study. JK and WL designed and implemented the core regression component. WL and TB developed the network programming component. XJ and LOM critically reviewed and edited the paper. All authors contributed to the manuscript preparation.
Figures & Table
References
- 1.Wu Y, Jiang X, Kim J, Ohno-Machado L. Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. Journal of the American Medical Informatics Association. 2012;19(5):758–764. doi: 10.1136/amiajnl-2012-000862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pletcher MJ, Forrest CB, Carton TW. PCORnet’s collaborative research groups. Patient related outcome measures. 2018;9:91. doi: 10.2147/PROM.S141630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Monsey L, Best LG, Zhu J, DeCroo S, Anderson MZ. The association of mannose binding lectin genotype and immune response to Chlamydia pneumoniae: The Strong Heart Study. PLoS One. 2019;14(1):e0210640. doi: 10.1371/journal.pone.0210640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Charbonneau J, Nicol D, Chalmers D, Kato K, Yamamoto N, Walshe J, et al. Public reactions to direct-to- consumer genetic health tests: A comparison across the US, UK, Japan and Australia. European Journal of Human Genetics. 2019. pp. 1–10. [DOI] [PMC free article] [PubMed]
- 5.Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology. 2015;19(1A):A68. doi: 10.5114/wo.2014.47136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chen F, Wang S, Jiang X, Ding S, Lu Y, Kim J, et al. Princess: Privacy-protecting rare disease international network collaboration via encryption through software guard extensions. Bioinformatics. 2017;33(6):871–878. doi: 10.1093/bioinformatics/btw758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, et al. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. Journal of the American Medical Informatics Association. 2015;22(6):1212–1219. doi: 10.1093/jamia/ocv083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Karr AF, Lin X, Sanil AP, Reiter JP. Secure regression on distributed databases. Journal of Computational and Graphical Statistics. 2005;14(2):263–279. [Google Scholar]
- 10.Slavkovic AB, Nardi Y, Tibbits MM. Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007) IEEE; 2007. ” Secure” Logistic Regression of Horizontally and Vertically Partitioned Distributed Databases; pp. 723–728. [Google Scholar]
- 11.Nardi Y, Fienberg SE, Hall RJ. Achieving both valid and secure logistic regression analysis on aggregated data from different private sources. Journal of Privacy and Confidentiality. 2012;4(1) [Google Scholar]
- 12.Yu H, Jiang X, Vaidya J. Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. Proceedings of the 2006 ACM symposium on Applied computing. 2006. pp. 603–610.
- 13.Li Y, Jiang X, Wang S, Xiong H, Ohno-Machado L. Vertical grid logistic regression (VERTIGO) Journal of the American Medical Informatics Association. 2016;23(3):570–579. doi: 10.1093/jamia/ocv146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hosmer DW, Jr, Lemeshow S, Sturdivant RX. vol. 398. John Wiley & Sons; 2013. Applied logistic regression. [Google Scholar]
- 15.Reilly MP, Li M, He J, Ferguson JF, Stylianou IM, Mehta NN, et al. Identification of ADAMTS7 as a novel locus for coronary atherosclerosis and association of ABO with myocardial infarction in the presence of coronary atherosclerosis: two genome-wide association studies. The Lancet. 2011;377(9763):383–392. doi: 10.1016/S0140-6736(10)61996-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Minka T. A comparison of numerical optimizers for logistic regression (Technical Report) Microsoft Research; 2003. [Google Scholar]
- 17.Seber GA, Lee AJ. vol. 329. John Wiley & Sons; 2012. Linear regression analysis. [Google Scholar]
- 18.Snyman JA, Wilke DN. Practical Mathematical Optimization: Basic Optimization Theory and Gradient-Based Algorithms. vol. 133. Springer; 2018. [Google Scholar]


