Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 17.
Published in final edited form as: IEEE Trans Inf Forensics Secur. 2024 May 17;19:5751–5766. doi: 10.1109/tifs.2024.3402319

Efficient Privacy-preserving Logistic Model With Malicious Security

Guanhong Miao 1, Samuel S Wu 1
PMCID: PMC11236440  NIHMSID: NIHMS1998323  PMID: 38993695

Abstract

Conducting secure computations to protect against malicious adversaries is an emerging field of research. Current models designed for malicious security typically necessitate the involvement of two or more servers in an honest-majority setting. Among privacy-preserving data mining techniques, significant attention has been focused on the classification problem. Logistic regression emerges as a well-established classification model, renowned for its impressive performance. We introduce a novel matrix encryption method to build a maliciously secure logistic model. Our scheme involves only a single semi-honest server and is resilient to malicious data providers that may deviate arbitrarily from the scheme. The d-transformation ensures that our scheme achieves indistinguishability (i.e., no adversary can determine, in polynomial time, which of the plaintexts corresponds to a given ciphertext in a chosen-plaintext attack). Malicious activities of data providers can be detected in the verification stage. A lossy compression method is implemented to minimize communication costs while preserving negligible degradation in accuracy. Experiments illustrate that our scheme is highly efficient to analyze large-scale datasets and achieves accuracy similar to non-private models. The proposed scheme outperforms other maliciously secure frameworks in terms of computation and communication costs.

Index Terms—: Privacy-preserving, logistic model, malicious adversary, indistinguishability

I. Introduction

Internet of Things (IoT) approaches our lives gradually with the wireless communication systems increasingly employed as technology driver for smart monitoring and applications. An IoT system can be depicted as smart devices that interact on a collaborative basis for a common goal. Smart cities are incorporating a wide range of advanced IoT infrastructures, resulting in a large amount data gathered from different devices deployed in many domains, such as health care, energy transmission, and transportation [1]. Smart things provide efficient tools for ubiquitous data collection or tracking, but also faces privacy threats.

In order to solve the challenges arising from IoT data processing and analysis, an increasing amount of innovations have been emerged recently. For instance, collaborative learning is a desirable and empowering paradigm for smart IoT systems. Collaborative learning enables multiple data providers to learn models utilizing all their data jointly [2], [3]. Typical collaborative systems are distributed computing systems such as secure multi-party computation (SMC) frameworks [2], [4]. SMC enables parties to jointly compute on private inputs without revealing anything but the result.

Collaborative learning has benefited the society including medical research [5]. Data containing healthcare informatics are usually collected in medical centers such as hospitals. Generally, the study center does not share data with other institutes considering the confidentiality of participants. To learn disease mechanisms especially rare diseases for which each center has limited cases, it is of importance to perform data analysis combining data from multiple institutes. Collaborative learning provides great promise to connect healthcare data sources. Since data sharing of individual levels is not permitted by law or regulation in many domains, various privacy-preserving techniques have been developed to perform collaborative learning.

Many privacy-preserving techniques assume semi-honest models, in which the server and clients follow the protocol specification. Because clients could be any arbitrary entity, it is less likely that all the clients (i.e., data providers) would be semi-honest. Recently, maliciously secure models [6], [7] have been proposed to achieve privacy in the presence of malicious adversaries that could deviate arbitrarily from the protocol specification. Based on the assumption of the number of servers that can be malicious in the protocol, maliciously secure frameworks operate in either an honest-majority setting [6]–[14] or a malicious-majority setting [15]–[18]. These frameworks typically rely on multiple servers (e.g., threeserver model [6], [7], [9], [10], four-server model [8], [12]–[14]), with the most common assumption being an honest-majority setting (i.e., a majority of servers are semi-honest). In contrast, the malicious-majority setting anticipates a scenario where a majority of servers may behave maliciously. This setting enhances security in environments where a significant portion of servers may be untrustworthy, providing a more realistic and robust solution in adversarial conditions. Since the efficiency of SMC protocols is highly dependent on the number of honest servers [13], maliciously secure frameworks with the malicious-majority setting is less efficient than those with the honest-majority setting.

In this paper, we propose a privacy-preserving logistic model scheme, assuming a dishonest majority in a maliciously secure setting. We assume that data are horizontally distributed among data providers (i.e., each data provider is a client that collects information of the same features for different samples). Our contributions are summarized as follows:

  1. We propose a novel matrix encryption technique to build a maliciously secure logistic model. Unlike state-of-the-art frameworks that necessitate the involvement of two or more servers in an honest-majority setting, our scheme involves only a single semi-honest server and is resilient to malicious attacks conducted by data providers. Malicious behaviors conducted by data providers are detectable during the verification stage.

  2. The proposed matrix encryption method combines Gaussian matrix encryption with d-transformation and commutative matrix encryption. The implementation of d-transformation ensures that random records within any energy range are indistinguishable. The commutative matrix encryption is applied to preserve data utility.

  3. We utilize a lossy compression method to reduce communication costs while ensuring negligible degradation in accuracy. Compared with other maliciously secure frameworks, our scheme is more efficient to analyze large-scale datasets in terms of computation and communication costs.

II. Related work

Secure multi-party computation (SMC)

SMC frameworks with a small number of parties have shown to be particularly attractive recently. Among these frameworks, homomorphic encryption (HE) [22], [23] has been widely used to protect data privacy [10], [11], [18]-[20]. Recent advances on garbled circuit [24], [25] have led to a set of privacy-preserving protocols [15]-[17] for SMC tolerating an arbitrary number of malicious corruptions. Garbled circuits and HE techniques require large volumes of ciphertexts to be transferred or have high computation complexity. In terms of efficient constructions, various secure frameworks in an honest-majority setting [6]–[14] have drawn phenomenal attention. The details of these frameworks are summarized in Table I.

Table I:

Recent secure multi-party computation (SMC) frameworks

Framework No. of parties/servers Encryption method Threat model Collusion assumption
[15] 2 Garbled circuit Malicious Malicious-majority
[16] ≥2 Garbled circuit Malicious Malicious-majority
[17] ≥2 Garbled circuit Malicious Malicious-majority
[8] 4 Garbled circuit Malicious Honest-majority
[9] 3 Garbled circuit Malicious Honest-majority
[18] Homomorphic encryption Malicious Malicious-majority
[19] 1 Homomorphic encryption Semi-honest Passive adversary
[10] 3 Mixed Malicious Honest-majority
[11] 2 Mixed Malicious Honest-majority
[20] 2 Mixed Semi-honest Passive adversary
[12] 3, 4 Joint message passing Malicious Honest-majority
[13] 3, 4 SPDZ Malicious Honest-majority
[21] 2 Secret sharing Semi-honest Passive adversary
[14] 4 Secret sharing Malicious Honest-majority
[6] 3 Secret sharing Malicious Honest-majority
[7] 3 Secret sharing Malicious Honest-majority
Our 1 Matrix encryption Malicious Malicious-majority

Malicious model: the entity deviates arbitrarily from the protocol specification. Semi-honest model: the entity follows the prescribed protocol but attempts to gain unauthorized information by covertly observing the communication or computations of other entities involved. Mixed encryption method: the framework applies both garbled circuit and homomorphic encryption.

Differential privacy

Differential privacy (DP) [26] has been widely incorporated into distributed deep learning [27]–[29] by adding noise to input data, loss functions, gradients, weights, or output classes. Moreover, DP has been applied to enable secure exchanges of intermediate data and obtain models resilient to adversarial inferences in the federated learning [30]–[32]. There are still some challenging issues to implement DP in practice since it requires high privacy budgets to train robust and accurate models and the level of privacy achieved in practice remains unclear [33].

Matrix encryption

Matrix encryption has been extensively utilized in the development of compressed sensing (CS)-based cryptosystems [34]–[37]. This approach is well-suited for ensuring the security of practical applications, such as the Internet of Things and multimedia. The Gaussian one-time sensing CS-based cryptosystem, which employs a random Gaussian matrix and renews the matrix at each encryption, has been proven to be asymptotically secure for the plaintext with constant energy [34]–[36]. It is challenging to practically implement these CS-based cryptosystems because the indistinguishability of Gaussian matrix encryption is highly sensitive to variations in the energy of plaintexts.

To summarize, a majority of maliciously secure models necessitate the involvement of two or more servers in an honest-majority setting. Moreover, existing secure models have relatively low efficiency for large-scale data analysis. Previous studies of Gaussian matrix encryption ensures indistinguishability among records with constant energy (i.e., Euclidean norm) and poses a practical challenge when implementing it with data having arbitrary energy ranges. This paper introduces a maliciously secure logistic model which ensures indistinguishability among random records within any energy range. Our model assumes the malicious-majority setting and is highly efficient in analyzing datasets of substantial size.

III. Preliminaries

A. Logistic model

Consider a set of data D=x1,y1,,xn,yn, where xiq and yi{0,1} denotes the binary outcome such as case/control status of xi(i=1,,n). Without loss of generality, a constant 1 is typically added as the first element to the record xi(i=1,,n) to account for the intercept. The logistic model [38], [39] has the form

logPryi=1xiPryi=0xi=xiTβ. (1)

where β=β1,,βqT is a q-dimensional coefficient vector and Pr(⋅) is the probability function. Model estimate of the logistic model is typically fitted through maximum likelihood, using the conditional likelihood. The log-likelihood is

β=i=1nyilogpxi;β+1-yilog1-pxi;β (2)

where pxi;β=Pryi=1xi;β=expxiTβ1+expxiTβ.

For ridge regularized logistic model, we maximize the log-likelihood subject to a size constraint on L2-norm (i.e., Euclidean norm) of the coefficients. The ridge estimate is

βridge=argminβ-(β)+λ2β22 (3)

where λ0 is the ridge parameter.

Let Y=y1,,ynT denote the outcome, X=x1,,xnT denote the n×q feature matrix, and let W be the n×n diagonal matrix of weights (Equation 4).

Wpx1;β1-px1;βpxn;β1-pxn;β (4)

We use Newton’s method to fit the logistic model. Given βold, a single Newton update is

βnew=βold-2(β)ββT-1(β)β (5)

where the derivatives are evaluated at βold. The equation can be expressed using matrix notations as follows.

βnew=XTWoldX+Λ-1XTWoldXβold+XTY-pold (6)

where pold=px1;βold,,pxn;βoldT and pxi;βold1-pxi;βold is the i-th diagonal element in the diagonal matrix Wold.Λ is a matrix of zeros for non-regularized logistic model. For ridge regularized logistic model, Λ is a diagonal matrix with the diagonal elements being {0,λ,,λ}.

B. Indistinguishability

Indistinguishability has been widely used as the security measure in recent cryptosystems. Using different notations (e.g., Definitions 3.9 and 3.10 in [40], Definition 2.1 in [41], Definition 2 in [42], Definition “PrvInd” in [43], Definition 2 in [44], Definition 1 in [45], Definition 1 in [35], Section III in [36]), all these indistinguishability definitions express the same security level: a cryptosystem has the indistinguishability if no adversary can determine in polynomial time which of the two plaintexts corresponds to the ciphertext, with probability significantly better than that of a random guess. In other words, within a cryptosystem with indistinguishability, an adversary cannot learn any partial information of the plaintext in polynomial time given a ciphertext. Comprehensive comparisons of indistinguishability with other security measures (e.g., differential privacy) are given in [41], [42]. In line with other cryptosystems utilizing matrix encryption methods [35], [36], [44], we provide the formal definition of indistinguishability, denoted as Definition 1, following Definition 1 in [35], Section III in [36], and Definition 2 in [44].

Definition 1. Let pd be the probability that an adversary can successfully discern which of the two plaintexts corresponds to the ciphertext using any algorithm that operates within polynomial time. Then a cryptosystem is indistinguishable if there is a negligible function ϵ(q) such that for all plaintext length q,

pd12+ϵq. (7)

ϵ(q) is negligible if there exists an integer qc for every positive constant c such that ϵ(q)<q-c for all q>qc.

Let dTVp1,p2 be the total variation (TV) distance [46] between p1=Pryt1 and p2=Pryt2 where pi is the probability distribution of y conditioned on ti(i=1,2). Based on [47], the probability to successfully distinguish the plaintexts is bounded by

pd12+dTVp1,p22. (8)

where dTVp1,p2[0,1]. If dTVp1,p2=0, the probability of success is at most equivalent to a random guess, leading to indistinguishability [40].

Computing dTVp1,p2 directly is difficult [48] and we employ the Hellinger distance [46] to bound the TV distance. Let dHp1,p2 be the Hellinger distance [46] and it can give both lower and upper bounds on the TV distance [49], i.e.,

dH2p1,p2dTVp1,p2dHp1,p22-dH2p1,p2 (9)

where dHp1,p2[0,1]. Moreover, if p1 and p2 are multivariate Gaussian distributions (i.e., the ciphertext y conditioned on th follows Gaussian distribution with zero mean and covariance matrix Ch,h{1,2}), the Hellinger distance between p1 and p2 is given by [50] and [51]

dHp1,p2=1-C114C214C312 (10)

where C3 is defined as the average of C1 and C2 (i.e., C3C1+C22). Formal definitions and properties of total variation and Hellinger distances are given in [46]–[48].

C. Adversarial attacks on matrix encryption methods

In previous privacy-preserving frameworks using matrix encryption techniques [37], [52]–[54], adversarial attack models are classified into four levels: ciphertext-only attack (COA), known-plaintext attack (KPA), chosen-plaintext attack (CPA), and chosen-ciphertext attack (CCA).

Ciphertext-only attack (level 1):

The adversary is assumed to have access to ciphertexts and no other information. Within the COA, the adversary attempts to retrieve sensitive information using the ciphertexts.

Known-plaintext attack (level 2):

The adversary has access to the ciphertexts and corresponding plaintexts. Within the KPA, the attacker attempts to recover sensitive information by analyzing ciphertexts and their corresponding plaintexts.

Chosen-plaintext attack (level 3):

Given any plaintext, the adversary can get its corresponding ciphertext within the CPA. The adversary attempts to recover the encryption key or algorithm by examining associations between plaintexts and ciphertexts.

Chosen-ciphertext attack (level 4):

Within the CCA, the adversary has the capability to obtain the decryption of any ciphertexts of its choice. The adversary attempts to determine the plaintext that was encrypted to give some other ciphertexts.

In our scheme, no ciphertext is decrypted and thus it is impossible for adversaries to obtain the decryption of any ciphertext. Therefore, adversaries could not perform CCA and we only consider the first three attacks.

D. Matrix encryption

Assume x is a random row in the dataset X containing n rows and q columns. To encrypt x, a random Gaussian matrix, denoted as B with dimensions q×q, is generated, where each element follows a Gaussian distribution Nμ,σ2, and μ and σ are parameters of the distribution. The encryption function for row x and data X can be summarized as fB,r(x)xB and fB,r(X)XB. Each column in the dataset X can be encrypted similarly. Specifically, let xc be a random column in X. A n×n random Gaussian matrix A is generated for xc encryption. The encryption function for column xc and data X can be summarized as fA,lxcAxc and fA,l(X)AX.

IV. System Overview

A. System model

We investigate collaborative learning in which data are collected and owned by different data providers, referred to as clients. The goal is to build an efficient logistic model using data from all the clients while ensuring privacy protection. We consider that data are horizontally distributed, i.e., clients have different sets of samples and the same set of features (Figure 1A.

Fig. 1:

Fig. 1:

A: an example showing the horizontal partitioning scenario with three data providers (referred to as clients); B: workflow of the proposed privacy-preserving logistic model.

The proposed scheme involves multiple clients and one server who is responsible for the secure computation. The privacy-preserving scheme contains four stages: encryption, modeling, decryption and verification (Figure 1B). Clients perform data encryption as the initial step. After the encryption process, the data are sent to the server for the secure computation. The server then sends encrypted model results back to clients. Subsequently, clients decrypt the model results. Finally, the server and clients initiate the verification stage to identify any malicious activity conducted by the clients.

B. Threat model

Maliciously secure frameworks with multiple servers have been particularly rich [6], [7], [12], [13]. These frameworks assume at least one server is semi-honest and does not collude with malicious adversaries. Following these frameworks, the sole server in our scheme is assumed to be semi-honest, while clients are allowed to be malicious. Precisely, we assume that the server faithfully executes the delegated computations but may be curious about the intermediate data and try to learn or infer any sensitive information. In contrast, clients are likely to act maliciously (i.e., arbitrarily deviate from the predefined scheme to cheat others). Clients may collude with each other while the server is not allowed to collude with malicious clients. The possible adversary behaviors of malicious clients and the semi-honest server are summarized as follows.

  1. To perform the ciphertext-only attack (CPA), clients insert fake plaintexts into the privacy-preserving scheme and collude with each other to share both plaintexts and their corresponding ciphertexts. Detailed description of the CPA is given in Section VI.

  2. Malicious clients do not follow the proposed encryption method to encrypt data (i.e., each client’s data should be encrypted sequentially by all the clients using commutative matrices). After getting sufficient data for CPA, malicious clients may choose to skip subsequent computations in order to reduce the computation cost.

  3. Malicious clients do not follow the decryption procedure (i.e., the encrypted model result derived by the server should be decrypted sequentially by all the clients using commutative matrices which have been used for data encryption). Once malicious clients have gathered enough information for CPA, they may choose to skip the computations and send fake data to other clients in the decryption procedure.

  4. The semi-honest server attempts to retrieve sensitive information from received ciphertexts.

C. Design goals

Our design goals contain four aspects. Privacy: The private data have to remain confidentiality at any time. Learning verifiability: There should be a verification stage to check whether all the clients behave honestly. Correctness: The scheme is able to derive correct model result if all the clients and the server behave honestly. Efficiency: The scheme is computationally efficient and achieves high accuracy.

V. Proposed scheme

A. Data encryption, modeling and decryption

Suppose there are K clients and client i owns the data Xi (i.e., plaintext) with ni samples and q features (i=1,2,,K). These K clients collect the same q features for different samples. The analytical model will be built using the aggregated data, i.e., X=X1XK. To ensure data confidentiality, we introduce a novel privacy-preserving logistic model that employs random matrices for encryption. Specifically, client i encrypts Xi using random encryption matrices Ai and Bi. Aggregating these encrypted datasets, we get Xenc=A1X1B1AKXKBK. Since Ai and Bi(i=1,,K) are generated randomly by client i, the aggregated dataset does not preserve data utility. In order to maintain the utility of the data, we require that 1) Bi(i=1,,K) is designed to be commutative (i.e., BiBj=BjBi for ij, Appendix A), 2) Xi is subsequently encrypted by client j using Bj (ji), and 3) the encryption matrix Ai is decrypted by the server prior to the secure computation. The commutative nature of the random encryption matrix Bi guarantees that the resulting encrypted dataset remains independent of the order in which clients perform encryption. Clients send the encrypted data (i.e., ciphertexts) to the server after the encryption. The server then decrypts Ai and obtains the aggregated data X1BXKB=XB where B=iKBi=B1B2BK, as Bi are Bj designed to be commutative (i.e., BiBj=BjBi for ij). Table II summarizes symbols in the proposed scheme.

Table II:

Notations

Notation Description
K Number of clients (data providers)
Xi,Yi Data (plaintext) collected by client i(i=1,,K)
Zi Transformed outcome Zi=YiTXi
ni Number of samples in Xi and Yi
q Number of features in Xi
X,Y Aggregated data (plaintext)
n Number of samples in X and Y
xi The i-th record (row) in X
xc,i The i-th column in X
W A diagonal matrix (Equation 4)
Ai Random Gaussian matrix generated by client i
Ai-1 Inverse of matrix Ai
B02 Random Gaussian matrix shared among K clients
bij Random coefficient generated by client i(j=1,,q)
Bi Commutative matrix generated by client iBi=j=1qbijB0j
B Encryption matrix B=ΛBi
B-1 Inverse of matrix Bi
Xenc Encrypted X (ciphertext)
Zenc Encrypted outcome (ciphertext)
XT Transpose of X
d A constant to ensure indistinguishability (Algorithm 1)
β Model estimate (a q-dimensional vector) of non-secure model
βenc Model estimate (a q-dimensional vector) derived by the server
xi Euclidean norm of xi
dH Hellinger distance
dTV Total variation (TV) distance
dTV,low Lower bound of the TV distance
dTV,up Upper bound of the TV distance
pd Success probability in the indistinguishability experiment
ϵ(q) A negligible function
Ys Pseudo outcome Ys=XTX1
Ysenc Encrypted pseudo outcome
fAi,l(X) Encryption function fAi,l(X)=AiX
fB0,r(X) Encryption function fB0,r(X)=XB0
fr(X) Encryption function fr(X)=Xj=1qbjB0(j-1)

1=(1,1,,1)T.

V.

Pre-processing

Before data encryption, a pre-processing procedure is conducted by each client. Specifically, client i generates a pseudo record with all the values being 1 (i.e., (1,1,,1)) and adds it to Xi as the first row. The added row is used for malicious behavior detection. To encrypt the outcome information (i.e., Yi collected by client i), client i computes Zi=YiTXi and concatenates it into Xi as the last row. Additionally, client i calculates the Euclidean norm of each column in Xi. Let cj be the Euclidean norm of the j-th column (j=1,,q) and cm=maxjcj. Client i generates a vector, cv, with the j-th element of cv being cm2-cj2.cv is added to Xi as the last row. This procedure guarantees that the Euclidean norm of each column equals cm and is essential to ensure the indistinguishability of our encryption approach.

Without loss of generality, we include an intercept to the logistic model. Specifically, a vector of ones is added to Xi(i=1,,K) as the first column. To achieve indistinguishability, each client multiplies the elements in the first column by a constant d, where d is selected by Algorithm 1 (i.e., d-transformation). The d-transformation is performed before data encryption.

The proposed encryption procedures can be categorized into two layers: internal and external. The data are first encrypted by its owner internally and then subsequently encrypted by other clients, referred to as the external encryption.

Internal encryption

Client i first generates a random Gaussian matrix Ai to encrypt Xi. To improve the computation efficiency of the encryption for clients with large sample size, we partition the encryption matrix Ai into a diagonal matrix. The detailed description is given in Appendix B. Client i shares Ai with the server and encrypts the data as AiXi. Client i further encrypts AiXi using a specifically designed matrix Bi. To generate the specific Bi, a random q×q Gaussian matrix B0 is generated and shared among the K clients. Client i subsequently generates a random coefficient vector bi1,,biq and a client-specific matrix Bi=j=1qbijB0ji=1,2,,K. This ensures that Bi (generated by client i) and Bj (generated by client j)(ij) are commutative, i.e., BiBj=BjBi (Appendix A). Client i computes Xienc=AiXiBi and sends Xienc to other clients for external encryption.

External encryption

Upon receiving Xienc=AiXiBi from client i, client i+1 further encrypts it using Bi+1 (i.e., Xienc=AiXiBiBi+1) and sends the updated ciphertext Xienc to client i+2. Client i+2 then encrypts the ciphertext using Bi+2 and sends the ciphertext to client i+3. After all the K-1 clients complete the external encryption, the ciphertext is in the form of Xienc=AiXiB where B=ikBi=B1B2BK.Xienc is then sent to the server.

To build the ridge regression model, client i computes and sends BiTBi to client i+1. Client i+1 calculates BjTBiTBiBj and sends it to client i+2. Each of the K clients conducts the encryption sequentially. Once all clients complete encryption, j=1KBiTBiis sent to the server for ridge model computation.

Algorithm 2 describes the detailed encryption procedures and Figure 2 gives an example of the encryption procedures with three clients. Table III summarizes the internal and external encryption procedures. The primary goal of the internal encryption layer is to protect against malicious adversaries. To preserve data utility, the data are further encrypted by the other clients using commutative matrices in the external encryption layer.

Fig. 2:

Fig. 2:

An example showing the proposed encryption procedures including data transformation details with three clients. Xi denotes the data collected by client i(i=1,,K=3).YiTXi is integrated into Xi prior to the data encryption. Data are encrypted by all the clients sequentially. The arrows connect the origin and endpoint of the data transmission.

Table III:

Encryption details for the plaintext Xi (owned by client i)

Encryption layer Client Rationale Affected by malicious adversaries? Encryption matrix
Internal i Withstand malicious adversaries No Ai,Bi
External 1,2,,i-1,i+1,,K Preserve data utility Yes jiBj

A.

Modeling

Upon receiving Xenc=A1X1BAKXKB(i=1,,K) (and j=1KBiTBi for ridge regression), the server decrypts Ai to get X1BXKB=XB. For the subsequent analysis, the server defines XB as Xenc (i.e., XencXB) and further eliminates pseudo records in Xenc. The detailed procedures are described in Algorithm 3. The server retrieves the encrypted outcome from XB and denotes it as Zienc. According to the encryption procedure, Zienc=YiTXiB. Then the server derives a q-dimensional vector Zenci=1KZienc. Given the encrypted data, the Newton update becomes

βnewenc=XencTWencXenc+ΛBTB-1×XencTWencXencβoldenc+ZencT-XencTpenc (11)

where penc=px1;βoldenc,,pxn;βoldencT,

pxi;βoldenc=pxienc;βoldenc=expxiencTβoldenc1+expxiencTβoldenc (12)

for i=1,,n,Wenc is a n×n diagonal matrix with the i-th diagonal element being pxi;βoldenc1-pxi;βoldenc, Λ is a matrix of zeros for the non-regularized logistic model while Λ is a diagonal matrix with the diagonal elements being {0,λ,,λ} for ridge regression. The server computes model estimates using Equation 11 until converged (e.g., the minimum of the absolute difference between (βnewenc and βoldenc is smaller than 10−6).

Theorem 1. The privacy-preserving logistic model converges as long as the non-secure logistic model converges.

Proof. Let β(0)enc denote the initial point in Newton’s method within the privacy-preserving logistic model. According to Theorem 9 (Appendix C), it is equivalent to setting Bβ(0)enc as the initial point within the non-secure model. Given the initial point Bβ(0)enc, suppose the non-secure model converges after s iterations with the model estimate being β. Based on Theorem 9 (Appendix C), our privacy-preserving model also converges after s iterations with the model estimate being βenc=B-1β. □

A.

Decryption

The server sends the converged model estimate βenc to the clients. As shown in Theorem 9 (Appendix C), we have βenc=i=1KBi1β where β is the estimate of the non-secure model. To get the true model estimate β, client i(i=1,,K) uses the encryption matrix Bi to decrypt βenc. Figure 3 shows the detailed decryption procedure. β=(i=1KBi)βencis the result once all clients complete decryption.

Fig. 3:

Fig. 3:

The decryption procedure. Client i decrypts βenc sequentially (i=1,,K). The data above the arrow are those transferred among clients.

B. Multiclass classification

Our scheme can be modified to solve the multiclass classification problem. Suppose the outcome contains a total of ey classes. Client i defines ey sub-outcomes Yi(1),Yi(2),,Yiey using the indicator function 1jYi as follows.

Yi(j)=1jYi=1Yi=j,0Yij,forj=1,,ey. (13)

Following the above described scheme, client i calculates Zi(j)=Yi(j)TXij=1,,ey. Before data encryption, client i adds Zi(1),,Ziey to the feature matrix Xi as the last ey rows. Define the sub-outcome for the j-th classes as Zj=i=1KYijTXi for j=1,,ey. After data encryption, the server performs the secure logistic model computation (Algorithm 3) for each of the ey sub-outcomes using each pair of feature matrix and outcome Xenc,Z(j)enc for j=1,,ey.

C. Lossy compression with SZ

To reduce the communication cost, we employ the lossy compression technique with SZ [55] for data compression. Lossy compression with SZ is an error-bounded lossy compression scheme [55]–[57]. In our scheme, the data are compressed before being transferred between clients and the server.

D. Verification: malicious behavior detection

To identify if any client has conducted malicious behavior, a designated pseudo outcome Ys is subjected to the same procedures as the original outcome Y. Assuming all clients adhere to our scheme for both Y and Ys, predetermined outputs are anticipated after the verification stage.

Specifically, client i defines a constant τiYiTXi1 where 1=(1,1,,1)T and shares τi with the server. For the verification, client i generates a pseudo outcome YsiXiTXi1.Ys is the sum of Ysi, denoted as Ysi=1KYsi. Ys can be expressed as Ys=XTX1. Client i further encrypts Ysi using Bi (i.e., Ysienc=BiTYsi) and sends the ciphertext Ysienc to the other clients. Ysienc is subsequently encrypted by client j using Bj(j=1,,i-1,i+1,,K). After all the clients have completed the encryption process, Ysienc is shared with the server. Upon receiving Ysienc, the server verifies if ZiencXiencTXienc-1Ysienc=τi where Zienc=YiTXiB. If any client exhibits malicious behavior during the encryption process, the equation will not hold (Theorem 2). Since XTX-1XTX1=1, we design the following process to verify if the clients follow the proposed decryption procedure. The server first calculates the sum of encrypted pseudo outcomes (i.e., Ysenci=1KYsienc) and the estimate βsencXencTXenc-1Ysenc.βsenc can be simplified as follows.

V.

βsenc=XencTXenc-1Ysenc=XencTXenc-1i=1KYsienc=XencTXenc-1i=1KBTXiTXi1=(XB)TXB-1BTXTX1=B-11. (14)

The server shares βsenc with the client who performs the verification (e.g., client 1). To confirm that no malicious behavior was conducted within the decryption process, client 1 combines βsenc with βenc. Specifically, client 1 generates two random constants, α1 and α2, and defines a new estimate as (β˜)encα1βenc+α2βsenc. (β˜)enc is decrypted by all the clients following the procedure in Figure 3. Let β˜ be the decrypted estimate. Upon obtaining β˜, client 1 calculates βs1α2β˜-α1Bβenc where Bβenc is the decrypted model estimate. βs is expected to be a vector of ones if all the clients correctly decrypt both (β˜)enc and βenc following the proposed decryption procedure (Theorem 3). Algorithm 4 summarizes the verification process.

In order to preserve data utility, all the clients need to follow three encryption criteria. We utilize a case study involving two clients to better illustrate these criteria. The encryption procedures are shown in Figure 4. Initially, it is necessary for clients to employ uniform encryption matrices to encrypt datasets owned by other clients. This condition implies that B3=B2 and B4=B1 in the example. Additionally, these encryption matrices must be commutative with each other, which implies that the multiplication of B1 and B2 equals the multiplication of B2 and B1, denoted as B1B2=B2B1. Thirdly, clients use uniform inputs across the entire encryption process. More precisely, each client does not alter the data owned by other clients during data encryption. To violate the third criterion, client 2 may selectively encrypt specific rows within the dataset A1X1B1 or substitute A1X1B1 with fake data prior to transmitting it to the server.

Fig. 4:

Fig. 4:

An example of the proposed encryption procedures with two clients. Xi denotes the data collected by client i(i=1,2).YiTXi is integrated into Xi prior to the data encryption. The internal encryption is performed by each client, encrypting its own data internally before transmitting it to other clients. The external encryption is further conducted by the other clients. Finally, the server decrypts Ai from the data it has received and retrieves the encrypted outcome information.

Let X˜1enc and X˜2enc denote the ciphertexts transmitted to the server from client 2 and client 1, respectively. The server then decrypts Ai and extracts the encrypted outcome Z˜ienc(i=1,2). If all the clients adhere to these three criteria, the server should get the ciphertexts in Equations 15 and 16 (C-1, C-2, and C-3 refer to criteria 1 through 3).

Xenc=X˜1encX˜2enc=C-3A1X1B1B3A2X2B2B4=C-1A1X1B1B2A2X2B2B1=C-2A1X1B1B2A2X2B1B2=A1X1A2X2B1B2DecryptionX1X2B1B2. (15)
Zenc=Z˜1encZ˜2enc=C-3Y1TX1B1B3Y2TX2B2B4=C-1Y1TX1B1B2Y2TX2B2B1=C-2Y1TX1B1B2Y2TX2B1B2=Y1TX1Y2TX2B1B2. (16)

To preserve data utility, 1) Xi and Xj(ij) need to be encrypted by the same encryption matrix K=1KBk, and 2) Xi and YiTXi(i=1,,K) need to be encrypted by the same encryption matrix K=1KBk. The malicious behavior of any client during data encryption results in a breach of one or both of these two requirements, thereby affecting the utility of the data.

Theorem 2. The proposed verification algorithm can identify the malicious behavior during the encryption process.

Proof. Firstly, the server verifies if Xi and Xj(ij) are encrypted by the same encryption matrix by checking whether the first rows in Xienc and Xjenc are identical. To meet this requirement, all clients must adhere to the three encryption criteria indicated in Equations 15 and 16. Secondly, the server checks whether ZiencXiencTXienc-1Ysienc equals τi to ascertain if Xi and YiTXi have been encrypted using the same encryption matrix. Upon receiving Zienc=YiTXiB and Xienc=XiB, the server verifies if B=B by calculating

ZiencXiencTXienc-1Ysienc=YiTXiBXiBTXiB-1BTYsi=YiTXiBB-1XiTXi-1BT-1BTXiTXi1=B=BYiTXi1=τi (17)

for i=1,,K. The server affirms the absence of any malicious activities by validating the fulfillment of the two conditions mentioned above. □

In the decryption procedures, client i should use the encryption matrix Bi to decrypt βenc (output in Algorithm 3) (decryption criterion).

Theorem 3. The proposed verification algorithm can identify if the client violates the decryption criterion.

Proof. In our scheme, both βenc and (β˜)enc need to be decrypted following the decryption procedures. Let Bβenc and B(β˜)enc be the decrypted data of βenc and (β˜)enc, respectively. Since

(β˜)encα1βenc+α2βsenc, (18)

we have

β˜B(β˜)enc=α1Bβenc+α2Bβsenc. (19)

Suppose client i(i=1,,K) follows the proposed decryption procedures to decrypt both βenc and (β˜)enc,B and B should be identical (i.e., B=B=B where B=i=1KBi). So

βs1α2β˜α1Bβenc=1α2α1Bβenc+α2Bβsencα1Bβenc=B=B=Bβsenc. (20)

According to Equation 14,

Bβsenc=BXencTXenc-1Ysenc=1. (21)

Based on Equations 20 and 21, βs is expected to be a vector of ones if no malicious activities are involved in the decryption procedures. Therefore our verification stage can identify whether the client breaks the decryption criterion. □

VI. Security analysis

The encryption matrices Ai and Bi may be recovered if ciphertexts of different plaintexts are distinguishable. Once the encryption matrix is recovered, the client can recover other clients’ data. Potential attacks to achieve such goals include CPA, KPA, COA and CCA [34]–[36], [40] as described in Section III-C. It is impossible to perform CCA for our scheme because adversaries cannot obtain the decryption of any ciphertext. Since CPA is more threatening than KPA and COA, a secure scheme is resilient to KPA and COA if it protects against CPA.

CPA [37], [52]–[54] is reasonable in our scheme. Consider a robust threat model in which all clients, except one, can be compromised in a collusion attack (Figure 5). Suppose client 1 is the only honest client. In the external encryption layer, client i(i1) generates fake data Xi and sends it to client 1 for encryption. Client 1 uses B1 to encrypt Xi and returns the ciphertext to the other clients for further encryption. During the process, colluded clients share received ciphertexts from agency 1 and are able to match each plaintext Xi with its ciphertext XiB1. In a collusion attack, colluded clients cooperate as a group to share plaintexts and ciphertexts with each other. So the colluded group is able to insert arbitrary plaintexts and get the corresponding ciphertexts for the purpose of CPA. The colluded group will try to first recover the encryption matrix B1 and then retrieve the plaintext X1 owned by client 1.

Fig. 5:

Fig. 5:

An example of the strong threat model with all the clients except one can be compromised. Suppose client 1 is the only honest client and B1 denotes the commutative matrix for data encryption. Xi denotes the plaintext from client i(i=2,,K).

To be resilient to CPA, the encrypted data in our privacy-preserving model should have indistinguishability for any random plaintexts. In this section, we demonstrate that the ciphertexts of two arbitrary plaintexts are indistinguishable in our privacy-preserving scheme.

Define the encryption functions fM,l(X)MX and fM,r(X)XM where M is a random Gaussian matrix, i.e, each element of M follows Gaussian distribution. Since Bi=B0j=1qbjB0(j-1), the internal encryption function fXi=AiXiBi can be split into 3 sub-functions. Specifically, fXi=AiXiBi=AiXiB0j=1qbjB0(j-1)=frfB0,rfAi,lXi where fAi,l(X)=AiX,fB0,r(X)=XB0, and fr(X)=Xj=1qbjB0(j-1).

The clients have access to the ciphertext AiXiB. In contrast, the server receives the encryption matrices Ai from client i(i=1,,K) and derives the ciphertext XiB=Ai-1AiXiB.Ai is only used in the internal encryption layer and the function fAi,l(X)=AiX is employed to ensure that clients cannot conduct effective CPA. We first prove that records in AiXiB are indistinguishable to the clients (Section VI-A) and then demonstrate that records in XiB are indistinguishable to the server (Section VI-B).

A. Indistinguishability of AiXiB: security against clients

Theorem 4. Given the Gaussian matrix encryption function fM,l(X)=MX, where M is a random Gaussian matrix, the worst-case lower and upper bounds on dTVp1,p2 are

dTV,low=1-2xc,1xc,2xc,12+xc,22q/2, (22)
dTV,up=1-2xc,1xc,2xc,12+xc,22q. (23)

where p1=Pzxc,1,p2=Pzxc,2 and xc,h(h=1,2) are two arbitrary columns in X.

Proof. Based on the proof of [[35], Lemma 1], the covariance matrix of zh=Mxc,h conditioned on the plaintext xc,h is Ch=xc,h2I where xc,h denotes the Euclidean norm of xc,h and I is the identity matrix. Therefore, C3C1+C22=xc,12+xc,222I. Because C1,C2 and C3 are diagonal matrices, we can get their determinants as

Ch=xc,h2q,h{1,2} (24)

and

C3=xc,12+xc,222q. (25)

So

dHp1,p2=1-C114C214C312=1-2xc,1xc,2xc,12+xc,22q/2. (26)

According to the inequality relation between the Hellinger distance and the TV distance (Equation 9), we can get the lower and upper bounds of the TV distance as follows.

dTV,low=dH2p1,p2=1-2xc,1xc,2xc,12+xc,22q/2,dTV,up=dHp1,p22-dH2p1,p2=1-2xc,1xc,2xc,12+xc,22q. (27)

Theorem 5. Given the Gaussian matrix encryption function fM,r(X)=XM, where M is a random Gaussian matrix, the worst-case lower and upper bounds on dTVp1,p2 are

dTV,low=1-2x1x2x12+x22q/2, (28)
dTV,up=1-2x1x2x12+x22q (29)

where p1=Pzx1,p2=Pzx2 and xh(h=1,2) are two arbitrary rows in X.

Proof. The proof is identical to that provided in Theorem 4. □

Corollary 1. The success probability of an adversary in the indistinguishability experiment is bounded by

pd12+121-2x1x2x12+x22q. (30)

If each plaintext has constant Euclidean norm (i.e., x1=x2 for two random records x1 and x2), the cryptosystem has indistinguishability since pd0.5.

Corollary 1 ensures that no adversary can learn any partial information about the plaintext from a given ciphertext, as long as each plaintext has constant Euclidean norm. Because the Euclidean norms of all the columns in the plaintext (Xi are designed to be the same (Section V-A), any two arbitrary columns in AiXi are indistinguishable.

Theorem 6. The TV distance does not increase by encryption functions fB0,r and fr. In other words, dTV,upp˜1,p˜2dTV,upp1,p2 where p˜h and ph denote the probability distribution of a random row in AiXhBi and AiXh, respectively (h{1,2}).

Proof. Hellinger distance can be expressed as a function of Rényi divergence [58], i.e.,

dHp1,p2=21-e-12D12p1p2. (31)

where D12p1p2 denotes the Rényi divergence of p1 from p2. Based on the data processing inequality [[58], Theorem 1], D12p˜1p˜2D12p1p2. So dHp˜1,p˜2dHp1,p2. Because 0dHp1,p21 and function u2-u2 is monotonically increasing on 0u1,

dTVp˜1,p˜2dHp˜1,p˜22-dH2p˜1,p˜2dHp1,p22-dH2p1,p2. (32)

In other words, dTV,upp˜1,p˜2dTV,upp1,p2. □

Corollary 2. Clients are unable to learn any partial information of Xi in polynomial time within our privacy-preserving scheme, thereby ensuring that our scheme is resilient to the CPA conducted by the colluded clients.

Proof. According to Theorems 4, 6 and Corollary 1, the internal encryption function fXi=AiXiBi is indistinguishable. Theorem 6 indicates that the TV distance does not increase by the external encryption layer and thus AiXiB is indistinguishable where B=iKBi. Therefore, clients cannot perform effective CPA to learn the sensitive information of the other clients. □

B. Indistinguishability of XiB : security against the server

We further illustrate that any two arbitrary rows in XiB are indistinguishable. In the initial model training process (Algorithm 3), the server decrypts Ai from AiXiB and thus has access to XiB. With the indistinguishability of the encryption function fXi=XiB, the server cannot learn any partial information about Xi in polynomial time (Corollary 3).

Theorem 7. Given γ=x1x2(x1 and x2 are two arbitrary rows in Xi) and a negligible function ϵ(q),x1B and x2B are indistinguishable if γ satisfies γ+1γ21-4ϵ2(q)-1/q where q is the number of features.

Proof. The encryption function fXi=XiB=XiB0j=1qbjB0(j-1) can be split into 2 sub-functions, i.e., fB0,r(X)=XB0 and fr(X)=Xj=1qbjB0(j-1). According to Theorem 5 and the data processing inequality [[58], Theorem 1], the upper bound of the TV distance between Px1Bx1 and Px2Bx2 is dTV,up=1-2γγ2+1q. Indistinguishability requires that pd12+ϵ(q). To achieve this, we require that dTV,up2ϵ(q). So 1-4ϵ2(q)1/q2γγ2+1 for a given q. This leads to γ+1γ21-4ϵ2(q)-1/q. □

We propose a data transformation method, i.e., “ d-transformation”, to ensure the indistinguishability for arbitrary records, irrespective of whether they possess a consistent Euclidean norm or not. Assume x1<x2,γγ2+1 is monotonically increasing on 0<γ<1. The decrease in γ leads to the increase in dTV,up=1-2γγ2+1q. To maintain dTV,up within an acceptable range, the Euclidean norms of any two arbitrary records need to be close to each other (i.e., γ1). To achieve this, we perform the d-transformation for each record. More precisely, we use d (i.e., a vector of constant d) instead of a vector of ones as the intercept in Xi. The first element of x1 and x2 becomes d instead of 1. As presented in Figure 6A, a large d ensures that dTV,up is close to 0 for a fixed q. For fixed d,dTV,up increases as q rises (Figure 6B).

Fig. 6:

Fig. 6:

Fig. A: dTV,up given q=100 and different values of d. Fig. B: dTV,up given d=2,000 and different values of q.γ=x1x2 and 0<γ<1.

Theorem 8. The d-transformation ensures indistinguishability for any q and γ.

Proof. As described in Theorem 7, our scheme achieves indistinguishability depending on γ. Since γ+1γ is monotonically decreasing on 0<γ<1, there exists a minimum threshold γ- such that γ+1γ21-4ϵ2(q)-1/q for any γ>γ-. Let γoriginal=x1x2 where x1 and x2 are two random records in the plaintext. Define γnew=x˜1x˜2 where x˜1 and x˜2 are the d-transformed x1 and x2, respectively. So we have γnewγoriginal2+d21+d2. For 0<γ<1,γnew increases as d goes up. So there exists a constant d such that γnew1, implying that γnew+1γnew21-4ϵ2(q)-1/q given any q and γ. To conclude, the d-transformation guarantees indistinguishability for any q and γ. □

Considering that ϵ(p) (e.g., ϵ(p)2-q) can be large given small q, we set an upper bound for ϵ(q) (e.g., ϵ(p)=min10-5,2-q). As shown in Figure 6B, dTV,up increases when q goes up. Given ϵ(p)=min10-5,2-q,d=2,000 is sufficient to ensure indistinguishability when q20 (Figure 6B). For a large q, clients follow Algorithm 1 to select a constant d such that the encryption function in our scheme has indistinguishability.

Corollary 3. With the d-transformation (Algorithm 1), the encryption function fXi=XiB ensures the indistinguishability among any arbitrary records in Xi. This demonstrates that the server cannot learn any partial information about Xi in polynomial time.

Proof. Given any negligible function ϵ(q) and data Xi, Algorithm 1 selects a constant d such that γ+1γ21-4ϵ2(q)-1/q where γ=minj,kxj2+d2xk2+d2 (xj and xk are two arbitrary records in Xi). With the d-transformation, any two arbitrary records in Xi are indistinguishable (Theorem 8). □

The d-transformation (i.e., multiplication of the intercept with a constant d) does not alter logistic model estimates, except for the intercept estimate. After multiplying the intercept by d, the dataset becomes Xnew=XID, where ID represents a diagonal matrix with its diagonal elements being d,1,1,,1. Based on Equation 6, the model estimate for Xnew becomes βnew=ID-1β where ID-1 refers to a diagonal matrix with the diagonal elements being {1/d,1,1,,1}. So only the intercept estimate is altered by the multiplication of constant d, while the estimates for the q features remain unchanged.

VII. Performance evaluation

We perform experiments using the MNIST dataset from the UCI repository [59]. The MNIST dataset consists of hand-written digit images, each comprising a 28 by 28 pixel grid. Each image is associated with an integer label ranging from 0 to 9. The dataset consists of 60,000 images for training and 10,000 images for testing. In our privacy-preserving learning, we assume samples in each dataset are evenly distributed among K clients, with each subset encompassing all the features. All the experiments are performed in Matlab on the University of Florida HiPerGator 3.0 (i.e., high computing performance) with 1 CPU and 40 Gb RAMs.

Our privacy-preserving logistic model is applied to differentiate label 9 from the remaining labels (0 to 8). To evaluate model performance on the large-scale dataset, we apply the bootstrap method [60] to create three datasets with sample sizes of n=100,000, n=500,000, and n=1,000,000, respectively. To minimize communication cost, we employ SZ for lossy compression, ensuring a relative error threshold of less than 0.01 (i.e., the difference between raw data and compressed data <0.01*(maximum value-minimum value)). As shown in Table IV, our scheme with SZ compression after 20 iterations achieves an accuracy level equivalent to that of the non-secure model. The non-secure model is constructed using aggregated data from all clients without incorporating privacy protection considerations. The utilization of SZ compression leads to a notable decrease in communication costs. In scenarios where the number of clients involved in the privacy-preserving learning framework is less than 10, our scheme incorporating SZ compression incurs lower communication costs compared to the non-secure model (Figure 7A). Figure 7B shows that our scheme has high computation efficiency in analyzing large-scale datasets.

Table IV:

Model accuracy of the proposed privacy-preserving logistic model (relative error bound < 0.01 in the SZ compression)

Dataset Non-secure model Our scheme w/o SZ Our scheme with SZ
Iter=5 Iter=10 Iter=20 Iter=50
n=60,000 97.27% 97.27% 96.73% 97.22% 97.27% 97.27%
n=100,000 97.0% 97.0% 96.46% 96.97% 97.0% 97.0%
n=500,000 97.25% 97.25% 96.69% 97.18% 97.24% 97.25%
n=1,000,000 97.25% 97.25% 96.70% 97.20% 97.23% 97.25%

Non-secure model: the model built on the aggregated data from all the clients without considering privacy protection. w/o SZ: without SZ compression. Iter: iteration times. n: No. of samples in the aggregated data.

Fig. 7:

Fig. 7:

A: communication cost of the proposed scheme. w/o SZ: without SZ compression. n: No. of samples in the aggregated data. K: No. of the clients. B: computation time of 20 iterations in the proposed scheme. w/o SZ: without SZ compression. n: No. of samples in the aggregated data. K: No. of the clients.

We further compare the performance of our scheme with four state-of-the-art frameworks that provide malicious security. First, we evaluate the performance of our scheme for the binary classification problem and compare with two maliciously secure frameworks, SWIFT [12] and Fantastic4 [13]. Specifically, we construct the secure logistic model to distinguish between the digits 4 and 9 in the MNIST dataset (binary classification). A total of 11,791 samples are included in the training set, while the testing set comprises 1,991 samples. Moreover, we apply our scheme to solve the multiclass classification problem (Section V-B) and compare our model with state-of-the-art privacy-preserving neural networks, SecureNN [6] and Falcon [7]. The multiclass classification utilizes 60,000 images from the MNIST dataset for training and the remaining 10,000 images for testing.

Table V summarizes results of our scheme and other privacy-preserving frameworks. The computation and communication cost of our scheme goes up with an increasing number of clients. For the comparison, we consider three scenarios, with the number of clients being 10, 20, or 50. Compared with SWIFT [12], our scheme is computationally faster for clients up to 50 and has competitive communication cost for clients up to 20. In contrast, our scheme has improved communication efficiency but higher computation cost compared with Fantastic4 [13]. Our model has higher model accuracy compared with these two maliciously secure frameworks. Compared with their respective frameworks with 3 servers, the 4-server frameworks in [12], [13] have better performance in both computation and communication aspects. Given that the secure frameworks in [12], [13] rely on the honest-majority setting (where a malicious adversary can corrupt at most one server), the inclusion of an extra server imposes a more stringent security prerequisite for the successful execution of the secure frameworks. In contrast, our scheme conducts a secure logistic model that is resilient to malicious clients, utilizing only a single semi-honest server in the process. For multiclass classification, our scheme has better computation and communication performance when compared to maliciously secure neural networks under the semi-honest assumption [6], [7]. The accuracy of our secure scheme is also comparable to these two neural network models with malicious security.

Table V:

Comparison between our model and other maliciously secure frameworks using MNIST dataset

Data Framework Comp. Comm. Accuracy
MNIST (4 vs. 9) SWIFT [12] (3PC) 12 mins 96.5 Mb
SWIFT [12] (4PC) 8.6 mins 44 Mb
Fantastic4 [13] (3PC) 8.5 s 2.8 Gb 96.5%
Fantastic4 [13] (4PC) 3 s 167 Mb 96.5%
Our (w/o SZ, K=10) 17.5 s 204 Mb 98.9%
Our (SZ, K=10) 17.7 s 29.2 Mb 98.9%
Our (SZ, K=20) 29.6 s 58.4 Mb 98.9%
Our (SZ, K=50) 65 s 146 Mb 98.9%
MNIST SecureNN [6] (3PC) 1.03 hrs 110 Gb 93.4%
Falcon [7] (3PC) 33.6 mins 88 Gb 97.4%
Our (w/o SZ, K=10) 17.6 mins 1.1 Gb 98.0%
Our (SZ, K=10) 17.8 mins 164 Mb 98.0%
Our (SZ, K=20) 18.5 mins 328 Mb 98.0%
Our (SZ, K=50) 20.7 mins 821 Mb 98.0%

The performance of binary classification is compared with that of SWIFT [12] and Fantastic4 [13], while the performance of multiclass classification is compared with that of SecureNN [6] and Falcon [7].

The performance statistics of SWIFT [12] are sourced from [13], whereas the statistics for the other three SMC frameworks are obtained from their respective publications.

Comp.: computation cost; Comm.: communication cost.

3PC: 3-party computation (i.e., 3 servers); 4PC: 4-party computation (i.e., 4 servers).

K: No. of clients participated in our privacy-preserving scheme.

w/o SZ: without SZ compression.

VIII. Conclusion

In this paper, we introduce a maliciously secure logistic model for horizontally distributed data, utilizing a novel matrix encryption technique. Unlike state-of-the-art secure frameworks that require the participation of two or more servers in an honest-majority setting, our scheme utilizes only a single semi-honest server. Our scheme ensures that any two arbitrary records are indistinguishable through the d-transformation. A verification stage can detect any deviations from the proposed scheme among malicious data providers. Lossy compression is employed to minimize the communication cost while ensuring negligible degradation in accuracy. Compared with other maliciously secure models, our scheme has higher computational and communication efficiency. One prospective avenue for future research involves expanding our secure scheme to other nonlinear models, such as support vector machine and neural network.

Acknowledgments

This work was supported by the National Institutes of Health [R01 LM014027, U24 AA029959-01]. The authors would like to thank anonymous reviewers for many helpful comments.

Appendix A. Commutative matrix

Matrix B1 and B2 are commutative if B1B2=B2B1. To ensure negligible degradation in accuracy, the proposed privacy-preserving scheme generates commutative matrices to encrypt the plaintexts. The commutative encryption matrix is constructed based on matrix polynomial (i.e., a polynomial with matrices as variables) [61]. For instance, assume there are 2 clients in the collaborative learning. Each client first generates a common encryption key B0 (a random nonsingular matrix). Client 1 then generates a vector of random coefficients b11,b12,,b1q and an encryption matrix B1=j=1qb1jB0j=b11B0+b12B0B0+b13B0B0B0++b1qB0q. Similarly, client 2 generates a vector of random coefficients b21,b22,,b2q and an encryption matrix B2=j=1qb2jB0j=b21B0+b22B0B0+b23B0B0B0++b2qB0q.B1 and B2 are commutative (i.e., B1B2=B2B1) because B1 and B2 are both matrix polynomials of the common matrix B0.

Appendix B. Pre-processing and internal encryption for data with large sample size

To enhance the computational efficiency of encryption for clients with large sample sizes, we partition the encryption matrix Ai into a diagonal matrix. Specifically, client i generates Ai as

Ai=A0iA0iA0i (33)

where A0i is a 100×100 random Gaussian matrix. Xi is also partitioned into sub-matrices. Following the pre-processing procedure, a pseudo record is added to each sub-matrix to ensure indistinguishability. Specifically, Xi (assuming that a vector of 1 is already included as the first row) is partitioned into sub-matrices (Xil,l=1,2,, with each sub-matrix containing 99 samples and all the features. Before sub-matrices being encrypted by A0, client i computes the Euclidean norm of each column in Xil. Let clj be the Euclidean norm of the j-th column (j=1,,q) in Xil and clm=maxjclj. Client i generates a vector, clv, with the j-th element of clv being clm2-clj2. clv is added to Xil as the last row. Subsequently, the l-th sub-matrix consists of 100 samples, and the Euclidean norm of every column equals clm. As the total number of samples in Xi may not be divisible by 99, client i generates a random set of pseudo records to be vertically integrated into the original matrix such that each sub-matrix contains 99 samples. For example, consider the dataset Xi containing 1,070 samples. Client i generates 19 pseudo records, which are then vertically concatenated with Xi. Following this concatenation, Xi can be split into 11 submatrices, i.e., Xil(l=1,,11). Client i first concatenates Xil with the pseudo record clv and then encrypts each Xil with A0i. After data encryption, client i sends Ai and row indices of the pseudo records to the server.

Appendix C. Logistic model estimate using encrypted and original data

Theorem 9. Data matrices and model estimates of the secure and non-secure computation have the following properties.

  1. pxienc;βenc=pxi;β;

  2. Wenc=W;

  3. βenc=B-1β where B=i=1KBi.

Proof. Let

P1px1;βpx2;βpxn;β (34)

and

P21-px1;β1-px2;β1-pxn;β. (35)

P1 and P2 are diagonal matrices. Let

P1encP1Xenc,βencandP2encP2Xenc,βenc (36)

be the corresponding matrices in the privacy-preserving model.

In iteration m=0 (initial setup), set the starting point as β(0)enc for the privacy-preserving model and β(0)Bβ(0)enc for the non-secure model. Because xienc=BTxi, we have

pxienc;β(0)enc=Pryi=1xienc;β(0)enc=expxiencTβ(0)enc1+expxiencTβ(0)enc=expxiTBβ(0)enc1+expxiTBβ(0)enc=pxi;Bβ(0)enc=pxi;β(0). (37)

According to Equations 34, 35 and 37, we have P1enc=P1 and P2enc=P2.W (Equation 4) can be expressed as W=P1P2. Thus we have

Wenc=P1P2enc=P1encP2enc=P1P2=W. (38)

Therefore properties I-III hold in the initial step.

Next we prove that properties I-III hold in the m+1-th iteration assuming that properties I-III hold in the m-th iteration. Let notations with superscript(m) or subscript(m) denote parameters derived during the m-th iteration. Given β(m),pxi;β is updated as

pxi;β(m)(m+1)=expxiTβ(m)1+expxiTβ(m). (39)

Assuming that properties I-III hold in the m-th iteration, we have β(m)enc=B-1β(m). Since xienc=BTxi, we have

pxienc,β(m)(m+1)enc=expxiencTβ(m)enc1+expxiencTβ(m)enc=expxiTBB-1β(m)1+expxiTBB-1β(m)=expxiTβ(m)1+expxiTβ(m)=pxi;β(m)(m+1). (40)

Similar to the proof for iteration m=0, we have Enc (W)=W. So properties I-II hold in the (m+1)-th iteration. Moreover, based on Equation 11, we have

β(m+1)enc=XencTW(m)encXenc+ΛBTB-1×XencTW(m)encXencβ(m)enc+ZencT-XencTp(m)enc=BTXTW(m)XB+ΛBTB-1×BTXTW(m)XBB-1β(m)+BTZT-BTXTp(m)=B-1β(m+1). (41)

To conclude, properties I-III hold for all iterations. □

References

  • [1].Zhang Y, Yu R, Nekovee M, Liu Y, Xie S, and Gjessing S, “Cognitive machine-to-machine communications: visions and potentials for the smart grid,” IEEE Network, vol. 26, no. 3, pp. 6–13, 2012. [Google Scholar]
  • [2].Zhao C, Zhao S, Zhao M, Chen Z, Gao C-Z, Li H, and an Tan Y, “Secure multi-party computation: Theory, practice and applications,” Information Sciences, vol. 476, pp. 357–372, 2019. [Google Scholar]
  • [3].Li Q, Wen Z, Wu Z, Hu S, Wang N, Li Y, Liu X, and He B, “A survey on federated learning systems: Vision, hype and reality for data privacy and protection,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 4, pp. 3347–3366, 2023. [Google Scholar]
  • [4].Lindell Y, “Secure multiparty computation,” Commun. ACM, vol. 64 no. 1, pp. 86–96, 2020. [Google Scholar]
  • [5].Thapa C and Camtepe S, “Precision health data: Requirements, challenges and existing techniques for data security and privacy,” Computers in Biology and Medicine, vol. 129, p. 104130, 2021. [DOI] [PubMed] [Google Scholar]
  • [6].Wagh S, Gupta D, and Chandran N, “SecureNN: 3-party secure computation for neural network training.” Proc. Priv. Enhancing Technol, vol. 2019, no. 3, pp. 26–49, 2019 [Google Scholar]
  • [7].Wagh S, Tople S, Benhamouda F, Kushilevitz E, Mittal P, and Rabin T, “Falcon: Honest-majority maliciously secure framework for private deep learning,” Proceedings on Privacy Enhancing Technologies, vol. 2021, pp. 188–208, January 2021. [Google Scholar]
  • [8].Gordon SD, Ranellucci S, and Wang X, “Secure computation with low communication from cross-checking,” in Advances in Cryptology - ASIACRYPT 2018: 24th International Conference on the Theory and Application of Cryptology and Information Security, Brisbane, QLD, Australia, December 2–6, 2018, Proceedings, Part III. Springer-Verlag, 2018, pp. 59–85. [Google Scholar]
  • [9].Patra A and Suresh A, “BLAZE: blazing fast privacy-preserving machine learning,” in 27th Annual Network and Distributed System Security Symposium, NDSS 2020, San Diego, California, USA, February 23–26, 2020. The Internet Society, 2020. [Google Scholar]
  • [10].Mohassel P and Rindal P, “ABY3: A mixed protocol framework for machine learning,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery, 2018, p. 35–52. [Google Scholar]
  • [11].Lehmkuhl R, Mishra P, Srinivasan A, and Popa RA, “Muse Secure inference resilient to malicious clients,” in 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, Aug. 2021, pp. 2201–2218. [Google Scholar]
  • [12].Koti N, Pancholi M, Patra A, and Suresh A, “SWIFT: Super-fast and robust privacy-preserving machine learning,” in 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2021, pp. 2651–2668. [Google Scholar]
  • [13].Dalskov A, Escudero D, and Keller M, “Fantastic four: Honest-majority four-party secure computation with malicious security,” in 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2021, pp. 2183–2200. [Google Scholar]
  • [14].Byali M, Chaudhari H, Patra A, and Suresh A, “FLASH: Fast and robust framework for privacy-preserving machine learning,” Proceedings on Privacy Enhancing Technologies, vol. 2020, pp. 459–480, April 2020. [Google Scholar]
  • [15].Wang X, Ranellucci S, and Katz J, “Authenticated garbling and efficient maliciously secure two-party computation,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ‘17. Association for Computing Machinery, 2017, pp. 21–37. [Google Scholar]
  • [16].Wang X, Ranellucci S, and Katz J, “Global-scale secure multiparty computation,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ‘17. Association for Computing Machinery, 2017, pp. 39–56. [Google Scholar]
  • [17].Zhu R, Cassel D, Sabry A, and Huang Y, “NANOPI: Extreme-scale actively-secure multi-party computation,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ‘18. Association for Computing Machinery, 2018, pp. 862–879. [Google Scholar]
  • [18].Zheng W, Popa RA, Gonzalez JE, and Stoica I, “Helen: Maliciously secure coopetitive learning for linear models,” in 2019 IEEE Symposium on Security and Privacy (SP), 2019, pp. 724–738. [Google Scholar]
  • [19].Fan Y, Bai J, Lei X, Zhang Y, Zhang B, Li K-C, and Tan G, “Privacy preserving based logistic regression on big data,” Journal of Network and Computer Applications, vol. 171, p. 102769, 2020. [Google Scholar]
  • [20].Mohassel P and Zhang Y, “Secureml: A system for scalable privacy-preserving machine learning,” in 2017 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 19–38. [Google Scholar]
  • [21].Patra A, Schneider T, Suresh A, and Yalame H, “ABY2.0: Improved mixed-protocol secure two-party computation,” in 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2165–2182. [Google Scholar]
  • [22].Rivest RL, Adleman L, Dertouzos ML et al. , “On data banks and privacy homomorphisms,” Foundations of secure computation, vol. 4, no. 11, pp. 169–180, 1978. [Google Scholar]
  • [23].Acar A, Aksu H, Uluagac AS, and Conti M, “A survey on homomorphic encryption schemes: Theory and implementation,” ACM Comput. Surv, vol. 51, no. 4, pp. 1–35, 2018 [Google Scholar]
  • [24].Yao AC-C, “How to generate and exchange secrets,” in 27th annual symposium on foundations of computer science (Sfcs 1986). IEEE; 1986, pp. 162–167. [Google Scholar]
  • [25].Goldreich O, Micali S, and Wigderson A, “How to play ANY mental game,” in Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing. Association for Computing Machinery, 1987, pp. 218–229. [Google Scholar]
  • [26].Dwork C, “Differential privacy,” in Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II. Springer-Verlag, 2006, pp. 1–12. [Google Scholar]
  • [27].Zhu T, Ye D, Wang W, Zhou W, and Yu PS, “More than privacy: Applying differential privacy in key areas of artificial intelligence,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 6, pp. 2824–2843, 2022. [Google Scholar]
  • [28].Zhao L, Wang Q, Zou Q, Zhang Y, and Chen Y, “Privacy-preserving collaborative deep learning with unreliable participants,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1486–1500, 2020. [Google Scholar]
  • [29].Phan N, Vu MN, Liu Y, Jin R, Dou D, Wu X, and Thai MT, “Heterogeneous gaussian mechanism: preserving differential privacy in deep learning with provable robustness,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 2019, pp. 4753–4759. [Google Scholar]
  • [30].Li W, Milletarì F, Xu D, Rieke N, Hancox J, Zhu W, Baust M, Cheng Y, Ourselin S, Cardoso MJ et al. , “Privacy-preserving federated brain tumour segmentation,” in Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10. Springer, 2019, pp. 133–141. [Google Scholar]
  • [31].Wei K, Li J, Ding M, Ma C, Yang HH, Farokhi F, Jin S, Quek TQS, and Poor HV, “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3454–3469, 2020. [Google Scholar]
  • [32].Truex S, Baracaldo N, Anwar A, Steinke T, Ludwig H, Zhang R, and Zhou Y, “A hybrid approach to privacy-preserving federated learning,” in Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, ser. AISec’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1–11. [Google Scholar]
  • [33].Jayaraman B and Evans D, “Evaluating differentially private machine learning in practice,” in 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, 2019, pp. 1895–1912. [Google Scholar]
  • [34].Bianchi T, Bioglio V, and Magli E, “Analysis of one-time random projections for privacy preserving compressed sensing,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 2, pp. 313–327, 2016. [Google Scholar]
  • [35].Yu NY, “Indistinguishability and energy sensitivity of gaussian and bernoulli compressed encryption,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 7, pp. 1722–1735, 2018. [Google Scholar]
  • [36].Cho W and Yu NY, “Secure and efficient compressed sensing-based encryption with sparse matrices,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1999–2011, 2020 [Google Scholar]
  • [37].Kuldeep G and Zhang Q, “Design prototype and security analysis of a lightweight joint compression and encryption scheme for resource-constrained iot devices,” IEEE Internet of Things Journal, vol. 9, no. 1, pp. 165–181, 2022 [Google Scholar]
  • [38].Hastie T, Tibshirani R, Friedman JH, and Friedman JH, The elements of statistical learning: data mining, inference, and prediction. Springer, 2009, vol. 2. [Google Scholar]
  • [39].Maalouf M, “Logistic regression in data analysis: An overview,” Int. J. Data Anal. Tech. Strateg, vol. 3, no. 3, p. 281–299, 2011. [Google Scholar]
  • [40].Katz J and Lindell Y, Introduction to Modern Cryptography, Second Edition, 2nd ed. Chapman & Hall/CRC, 2014. [Google Scholar]
  • [41].He X, Machanavajjhala A, Flynn C, and Srivastava D, “Composing differential privacy and secure computation: A case study on scaling private record linkage,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ‘17. Association for Computing Machinery, 2017, p. 1389–1406 [Google Scholar]
  • [42].Wang W, Ying L, and Zhang J, “On the relation between identifiability, differential privacy, and mutual-information privacy,” IEEE Transactions on Information Theory, vol. 62, no. 9, pp. 5018–5029, 2016. [Google Scholar]
  • [43].Bellare M, Hoang VT, and Rogaway P, “Foundations of garbled circuits,” in Proceedings of the 2012 ACM Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery, 2012, pp. 784–796 [Google Scholar]
  • [44].Liu C, Hu X, Chen X, Wei J, and Liu W, “SDIM: A subtly designed invertible matrix for enhanced privacy-preserving outsourcing matrix multiplication and related tasks,” IEEE Transactions on Dependable and Secure Computing, pp. 1–18, 2023 [Google Scholar]
  • [45].Canetti R, “Universally composable security,” J. ACM, vol. 67, no. 5, 2020. [Google Scholar]
  • [46].Gibbs AL and Su FE, “On choosing and bounding probability metrics,” International Statistical Review / Revue Internationale de Statistique, vol. 70, no. 3, pp. 419–435, 2002. [Google Scholar]
  • [47].Cam LL, Asymptotic Methods in Statistical Decision Theory. Springer; New York, NY, 1986 [Google Scholar]
  • [48].DasGupta A, Asymptotic Theory of Statistics and Probability. Springer; New York, NY, 2008 [Google Scholar]
  • [49].Guntuboyina A, Saha S, and Schiebinger G, “Sharp inequalities for f-divergences,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 104–121, 2014 [Google Scholar]
  • [50].Kailath T, “The divergence and bhattacharyya distance measures in signal selection,” IEEE Transactions on Communication Technology, vol. 15, no. 1, pp. 52–60, 1967. [Google Scholar]
  • [51].Abou-Moustafa KT and Ferrie FP, “A note on metric properties for some divergence measures: The gaussian case,” in Proceedings of the Asian Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 25. Singapore Management University, Singapore: PMLR, 2012, pp. 1–15 [Google Scholar]
  • [52].Sun X, Tian C, Hu C, Tian W, Zhang H, and Yu J, “Privacy-preserving and verifiable SRC-based face recognition with cloud/edge server assistance,” Computers & Security, vol. 118, p. 102740, 2022 [Google Scholar]
  • [53].Liu C, Hu X, Zhang Q, Wei J, and Liu W, “An efficient biometric identification in cloud computing with enhanced privacy security,” IEEE Access, vol. 7, pp. 105363–105375,2019 [Google Scholar]
  • [54].Jasmine RM and Jasper J, “A privacy preserving based multi-biometric system for secure identification in cloud environment,” Neural Processing Letters, vol. 54, no. 1, pp. 303–325, 2022. [Google Scholar]
  • [55].Di S and Cappello F, “Fast error-bounded lossy hpc data compression with sz,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 730–739. [Google Scholar]
  • [56].Cappello F, Di S, Li S, Liang X, Gok AM, Tao D, Yoon CH, Wu X-C, Alexeev Y, and Chong FT, “Use cases of lossy compression for floating-point data in scientific data sets,” The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1201–1220,2019. [Google Scholar]
  • [57].Zhao K, Di S, Lian X, Li S, Tao D, Bessac J, Chen Z, and Cappello F, “SDRBench: Scientific data reduction benchmark for lossy compressors,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 2716–2724. [Google Scholar]
  • [58].van Erven T and Harremos P, “Rényi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014 [Google Scholar]
  • [59].Dua D, Graff C et al. , “Uci machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [60].Kulesa A, Krzywinski M, Blainey P, and Altman N, “Sampling distributions and the bootstrap,” Nature Methods, vol. 12, pp. 477–478 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [61].Hou S, Uehara T, Yiu S, Hui LC, and Chow K, “Privacy preserving confidential forensic investigation for shared or remote servers,” in 2011 Seventh International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2011, pp. 378–383 [Google Scholar]

RESOURCES