Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Nov 10.
Published in final edited form as: J Priv Confid. 2020 Jun;10(2):10.29012/jpc.674. doi: 10.29012/jpc.674

ON THE PRIVACY AND UTILITY PROPERTIES OF TRIPLE MATRIX-MASKING

A ADAM DING 1, GUANHONG MIAO 2, SAMUEL S WU 3
PMCID: PMC8580375  NIHMSID: NIHMS1751888  PMID: 34765907

Abstract

Privacy protection is an important requirement in many statistical studies. A recently proposed data collection method, triple matrix-masking, retains exact summary statistics without exposing the raw data at any point in the process. In this paper, we provide theoretical formulation and proofs showing that a modified version of the procedure is strong collection obfuscating: no party in the data collection process is able to gain knowledge of the individual level data, even with some partially masked data information in addition to the publicly published data. This provides a theoretical foundation for the usage of such a procedure to collect masked data that allows exact statistical inference for linear models, while preserving a well-defined notion of privacy protection for each individual participant in the study. This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator. We achieve this by splitting the trust between two parties, the “masking service provider” and the “data collector.”

Keywords: Data masking, privacy protection, matrix masking

1. Introduction

In the digital age, vast amount of data becomes available for research. At the same time, there is increasing pressure to protect the privacy of study subjects when their data is used. For medical research, the Health Insurance Portability and Accountability Act of 1996 and subsequent rulings have imposed legal requirements for privacy protection on the collection and handling of health data. Among other things, basic privacy protection measures include the removal of all personal identifiers when releasing data for use. However, simply removing the personal identifier variables does not prevent possible identification of the individual from other variables. To prevent the identification of an individual record, researchers have shown that released data should be aggregated to satisfy privacy conditions such as k-anonymity [Sweeney, 2002], l-diversity [Machanavajjhala et al., 2007] and t-closeness [Li et al., 2007].

However, releasing data only at aggregated levels severely restricts its usefulness in many research studies. Alternatively, methods are designed to release obfuscated micro-data that allows for the usual statistical analysis while preserving the privacy at individual levels. Some examples of such obfuscated micro-data publishing are: noise addition [Brand, 2002], multiple imputation[Rubin, 1993, Drechsler and Reiter, 2010], information preserving statistical obfuscation [Burridge, 2003], random projection based perturbation [Liu et al., 2006], random orthogonal matrix masking [Ting et al., 2008]. Particularly, in the random orthogonal matrix masking scheme, a masked data set AX is published, where X denotes the data matrix of real responses and A is a random orthogonal matrix. The published data AX keeps the exact values for sufficient statistics of linear models, thus allowing exact statistical inference for many standard data analysis methods [Ting et al., 2008, Wu et al., 2017b] while protecting privacy by denying the user’s direct access to the raw data X. While the above methods all protect the privacy of individual entries through publishing only the random perturbed micro-data, the privacy protection can be lost when multiple micro-data sets are combined from multiple inquires to the same database. Differential privacy is proposed to quantify the effectiveness of privacy protection of the random noise addition/perturbation schemes [Dwork et al., 2006, Dwork, 2006, Dwork and Naor, 2008] against multiple inquires to the database. Then the noise level can be adjusted to achieve a quantified tradeoff between inference efficacy and privacy preservation (measured by the differential privacy metric).

Traditionally, there is a trustworthy data collector/manager that collects raw data and ensure privacy protection by releasing the data sets with random perturbations. Such procedures however do not protect against attacks where an unscrupulous party has unauthorized access to the raw data set X kept by these centers. Such security breaks are becoming more common as shown by the recent well-publicized incidences involving hacking against databases at major retailers, banks and credit bureaus [Huffington Post, 2011, Reuters., 2015, 2017].

This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator, and provides a theoretical study of the triple matrix-masking (TM2) procedure [Wu et al., 2017b] that does not assume such a trustworthy data collector/manager. The TM2 procedure is a multi-party collection and masking system that aims to collect and publish the random orthogonal masked data set AX. We prove that, assuming no collusion between parties, no party learns more than the orbit of the data matrix under the action of the orthogonal group. More specifically, given the view of a particular party, let S be the set of data matrices that could possibly have resulted in that view. We show that S contains the full orbit of the data matrix and that given any prior on the data matrix, the party’s posterior is simply their prior restricted to S. We call data collection procedures with such properties as strong obfuscating since any extra information beyond AX available to a party does not help in further identifying the individual level data.

In the differential privacy literature, the issue of untrustworthy data collector can be dealt with using local differential privacy procedures [Kasiviswanathan et al., 2011], where noises are added to the individual data before passing to the data collector. The resulting synthetic data from differential privacy procedures, however, does not preserve exact statistics hence require special inference procedures designed to achieve optimal statistical inference Duchi et al. [2017]. Our TM2 procedure provides an alternative where the published masked data exactly preserve any statistics of the data that are preserved under the action of the orthogonal group. This provides an useful utility that exact statistical inference for linear models are preserved, thus standard linear statistical inference procedures can be applied directly on the resulting synthetic data from the TM2 procedure. On the other hand, the TM2 procedure is only for a one-shot collection of each individual’s data. When the individual data providers are sampled in multiple independent collections by different data collectors, differential privacy procedures can measure and limit the privacy leakage for the composition of the multiple collections. The TM2 procedure does not consider the privacy leakage for the composition of the multiple collections.

Section 2 describe the TM2 procedure and two new modifications to make it strong obfuscating. The theoretical analysis is provided in Section 3. Section 4 provides a summary and more detailed discussions for the relationship of the TM2 procedure to the differential privacy and multi party computation methods.

2. The Masked Data Collection Procedure TM2nd Its Modification

The privacy-preserving data collection scheme TM2 was proposed first in Wu et al. [2017b] and later expanded by Wu et al. [2017a]. We describe our modified basic version of the TM2 method here:

Step 1. The data collectors plan the data collection, create the database structure, program the data collection system. They randomly generate a p × p random orthogonal matrix B, which is distributed to the participants’ data collection devices.

Step 2. Each participant’s data x1 (a vector of dimension p1) is collected and merged with Gaussian noise x2 (of dimension p2) into a vector x = (x1, x2) of dimension p = p1 + p2. Then x is right multiplied by B on the participant’s device, and only the resulting masked data xB leaves the device and is sent to the masking service provider.

Step 3. The masking service provider generates another n × n random orthogonal matrix A2. After receiving data from all participants, it combines the individual data xB into a n × p matrix XB, left multiplies by A2 and sends the doubly masked data A2XB to the data collectors.

Step 4. The data collectors multiply A2XB by B−1 to get back A2X and take the first p1 columns to get A2X1. Then the data collectors generate another n × n random orthogonal matrix A1, left multiply it to A2X1, and publish AX1 (where A = A1A2) which is accessible by all data users.

Detailed theoretical analysis of the privacy guarantee on the TM2 method has been missing. This paper fills that gap by proving theoretically that this modified version of the TM2 method is strong obfuscating. We prove the strong obfuscating guarantee by showing that: (A) the extra information that any party in the process owns will not allow the party to reduce the data domain (possible values of data) small enough to identify individual level data; and (B) there is no statistical information leakage beyond the domain restriction considered in (A).

Compared to the original TM2 scheme in Wu et al. [2017b], we make two modifications on the TM2 procedure. For the first modification, we add random Gaussian noise in Step 2. The data collector wants to collect p1 variables on n individuals so that the real response matrix becomes X1 of dimensions n × p1. We ask each participant to generate p2 pure Gaussian noise variables, on his/her device according to a fixed variance parameter σ2. Hence, the full data matrix would be X = (X1, X2). For privacy protections proved in later sections, we require that p1 < np = p1 + p2. In Step 2 of this modified procedure, Gaussian noise x2 is mingled with real response x1 to provide protection in addition to the random mask B. In Step 4, after the collectors get back A2X = (A2X1, A2X2), they separate the matrix and discard those noises. Therefore the published data set AX1 with A = A1A2 still gives the exact summary statistic, as it is only masked by A without containing the added noise.

For the second modification, instead of using a random invertible matrix for the right mask B as originally proposed by Wu et al. [2017b], we use a random orthogonal matrix for the right mask B. As we will see in the privacy analysis in the next section, using an invertible matrix does make one part of the privacy proof easier. However, the other part of privacy proof depends on using a uniformly distributed random matrix to avoid information leakage that can lead to probabilistic attacks. While there is a natural uniform distribution on all orthogonal matrices that is well studied in literature, there is no natural uniform distribution on the set of all invertible matrices. The uniformly distributed orthogonal matrix B does provide sufficient privacy protection when combined with the addition of noise X2.

2.1. Privacy Analysis of TM2etup.

To rigorously study the privacy protection issues in this data collection process, we analyze the information that can be accessed by each party and analyze whether such information allows inference of the individual level data.

First, we illustrate how to analyze the privacy protection assuming that the adversary only has access to the publicly published left-masked data set ML = AX1 where A is a random n × n orthogonal matrix. The issue becomes whether an adversary can identify individual level data knowing only ML = mL.

We consider the analysis in two stages. When given ML = mL, this restricts the possible values of X1 and can thus reveal information. In the first stage, we consider whether this support restriction on X1 (due to ML = mL) enables the identification of individual data. Let SX1 denote the support of X1, and let SX1(mL) denote the restricted support of X1 given that ML = mL. The privacy preservation depends on the size of SX1(mL). For example, in an extreme case, if the SX1(mL) contains only one matrix, then X1 is known to everyone and data privacy cannot be protected. Generally we can show, in next section, that this restricted support SX1(mL) is big enough so that identification of individual data is impossible.

In the second stage, we consider whether the adversary can learn any information beyond the restriction on support which was analyzed in the first stage. Such information can enable adversaries to launch probabilistic attacks [Machanavajjhala et al., 2007, Fung et al., 2010]. Fortunately, due to the independence between the mask A and the raw data X1, we can show that the posterior density of X1 given ML = mL is the same as the prior density of X1 restricted to the support SX1(mL). Thus any loss of privacy is through the support restriction already studied in stage one. Therefore, knowing ML = mL does not identify individual level data.

Next, we consider the privacy protection for all parties involved in the whole TM2 data collection process. That is, we conduct the above two-stage privacy protection analysis given all information available to one party in the process. The data collector and the masking service provider each have access to some intermediate masked data in addition to the public data. Hence, we need to analyze privacy protection for an adversary knowing this intermediate masked data together with the public data set ML = AX1.

The data collector knows, in addition to ML = AX1, the double masked data A2XB. Since the data collector knows the masks A1 and B, knowing these data A1A2X1 and A2XB are simply equivalent to knowing A2X. Due to the fact that X2 is purely noise independent of raw data X1, the theoretical privacy analysis for the data collector knowing A2X = (A2X1, A2X2) will have basically the same results as the analysis for the user with access only to ML = AX1.

The masking service provider has access to the right-masked data MR = XB in addition to the public left-masked data ML = AX1. This information results in the most severe restriction on the support when compared to what resulted from knowledge by other parties. Thus, this is the weakest link for privacy preservation in the whole TM2 data collection scheme. In Section 3, we present details of the two-stage privacy protection analysis when both ML and MR are known.

3. Theoretical Analysis of Privacy Preservation of TM2

3.1. Notations, Formalizations and Technical Preliminaries.

We denote the probability densities of random matrices X1, X2, A and B as πX1(x1), πX2(x2), πA(a) and πB(b) respectively. The supports of these distributions are denoted respectively as SX1, SX2, SA and SB.

We want to study, based on information INFO available to one party, what this party can infer about the individual level data. Here this INFO includes the publicly available final left-masked data ML = AX1 and some extra information available to the particular party. The restricted support of X1 given INFO is denoted as SX1(INFO) which consists of all n × p matrices that can be the value of X1 which is compatible with INFO.

For example, given only the public masked data INFO = ML, the restricted support is

SX1(ML)={U:A˜SAsuchthatA˜U=ML.}

Let On denote the set consisting of all orthogonal matrices. In the case of left masking with a random orthogonal matrix A, for any orthogonal A¯On, U=A¯X1 is compatible with INFO = ML. That is, SX1(ML)=OnX1. To see this, let A˜=AA¯T, then A˜On and A˜(A¯X1)=ML. Here and throughout this paper, we use T to denote the transpose of a matrix.

For the strong obfuscating guarantee, we wish to show that the extra information available to the parties in the process does not cause any privacy loss more than the publicly released final left-masked data ML = AX1. We want to show that: stage one (i) the restricted support SX1(INFO) is the same as SX1(ML)=OnX1; stage two (ii) the conditional probability distribution of X1 given INFO is similar to the probability distribution of X1 given support SX1(INFO), thus there is no privacy loss through probability attacks beyond the loss from the support restriction considered in stage one.

We now formalize the precise mathematical statements to prove in stages one and two. More precisely for stage one, we hope that the restricted support is the same as if only the public left-masked data is available.

SX1(INFO)=OnX1, (i)

for INFO available to any one party in the process. For the second stage, we denote πX1INFO(x1INFO) as the posterior distribution of X1 given INFO. The prior density πX1 restricted on the support SX1(INFO) is

πX1SX1(INFO)(x1)=πX1(x1)SX1(INFO)πX1(x1*)dx1*.

To show that there is no extra privacy loss beyond the support restriction considered in stage one, we prove that these two probability densities agree with each other. That is, we wish to prove

πX1INFO(x1INFO)=πX1SX1(INFO)(x1). (ii)

Definition 3.1.

A data collection process is strong collection obfuscating if conditions (i) and (ii) hold for the information INFO available to any party in this process.

A slightly weaker version is that the above property holds with a high probability. Notice that the INFO available to any party in this process can be determined from the values of X1, X2, A and B which are generated respectively from distributions with densities πX1(x1), πX2(x2), πA(a) and πB(b). Thus such INFO is generated from a probability distribution defined by πX1(x1), πX2(x2), πA(a) and πB(b). We want that, with high probability from this probability distribution, the generated values of INFO satisfy conditions (i) and (ii).

Definition 3.2.

A data collection process is ϵ-strong collection obfuscating if, with probability at least 1 − ϵ, conditions (i) and (ii) hold for the information INFO available to any party in this process.

Our definition of the strong collection obfuscating procedure ensures that there is no privacy loss due to observations by any party in the process beyond those contained in the publicly released final data. This definition delineates the privacy protection in the collection process from the privacy protection in the publicly releasing of final data ML = AX1. The theoretical analysis concentrates on the soundness of the collection process.

Given the public left-masked data ML = AX1, the statistics X1TX1 are released to the user. The user has the first two exact statistical moments and statistical models, such as linear regression, can be fitted exactly as if the user has the raw data set X1. And the residuals are known up to an orthogonal matrix multiplication, therefore the usual statistical model diagnostics methods can also be carried out as if done on the raw data set.

For continuous data, the user cannot recover the individual level data since the user only sees a linear combination of all individuals’ data, and there is no utilizable statistical distributional information other than the prior (population) density πX1. This ensures the privacy of individual data.

In practice, the types of elements in X1 may also be known to the user. This can further restrict the support. We assume that the elements in data matrix X1 are all encoded as numerical values (e.g., “yes/no” answer to a question may be encoded as 1 and 0). We consider that the type of data in each column, either as continuous/discrete/binary, is public knowledge. Let Sj denote the support of the type of data in the j-th column of X1. For example, if data are continuous, then Sj=R; if the data are binary, then Sj={0,1}; if the data are positive integers, then Sj={1,2,}=N+. Knowing the type of data in each column would restrict the support of X1 to

S˜X1={U:(allentriesofj-thcolumninU)Sjj=1,,p1.}.

Then with knowledge of both INFO and types of data, the restricted support becomes the intersection of S˜X1 and SX1(INFO),

SX1(INFO;TYPE)=S˜X1SX1(INFO).

Let Pn denote the set of all permutation matrix P. Since all permutation matrices are orthogonal and permutation does not change the type of elements, we have the following Lemma:

Lemma 3.3.

For any strong obfuscating data collection process,

PnX1[S˜X1OnX1]=SX1(INFO;TYPE).

Lemma 3.3 indicates that a strong collection obfuscating data collection process offers some privacy protection even when the data types are known. Since all permutations are in the SX1(INFO;TYPE), any individual cannot be identified here without extra side information. It is not clear whether the type can be combined with some side information (such as data that a particular individual is a smoker) to reveal other individual level data. However, notice that any weakness in this aspect is inherently due to releasing the public data AX1. Our strong collection obfuscating procedure ensures that no extra privacy loss is added during the process beyond the privacy loss in releasing AX1.

As we discussed in the previous section, the party with the most information during the TM2 process is the masking service provider who knows INFO = (ML, MR). Therefore, in the next section, we study when (i) and (ii) hold for INFO = (ML, MR). Here we first state some technical preliminary results on the characterization of the uniform distribution on orthogonal matrices. These preliminary results are used in studying the second stage condition (ii) later.

Under the matrix multiplication, the orthogonal matrices form a compact Hausdorff topological group On. Therefore, there is a unique Haar measure μ(·) on On such that the measure of the whole sample space On equals one. Then this Haar measure induces a natural uniform distribution on On. See Chapter 2 of Zhang [2014] for a detailed technical equivalent characterization of the uniform distribution on On. Since a Haar measure μ(·) is invariant under the matrix multiplication, the uniform distribution is also invariant under the matrix multiplication.

Lemma 3.4.

Let π0(·) denote the probability density function of the uniform distribution on On. Then for any orthogonal matrix A0On,

π0(a)=π0(A0a)=π0(aA0),forallaOn. (3.1)

Also, the product of two uniformly distributed orthogonal matrices is also uniformly distributed.

Lemma 3.5.

If A1 ~ π0 and A2 ~ π0 are independent of each other, then their product A = A1A2 also follows the uniform distribution π0 on On.

The proof is straightforward and can be found in Chapter 2 of Zhang [2014].

In the TM2 scheme, when the masking service provider and the data collector each generate a random orthogonal matrix A1 and A2 respectively according to π0, then the mask A = A1A2 for the publicly released data set is also uniformly distributed. In practice, the uniformly distributed random orthogonal matrices can be generated using algorithms described in Heiberger [1978], Anderson et al. [1987], Wu et al. [2017b].

3.2. Restricted Support Given Knowledge of Masked Data Sets.

We first prove that condition (i) holds for INFO = (ML, MR) when invertible matrices are used for right mask B as originally proposed by Wu et al. [2017b]. That is, BIn, where In denote the set of all n × n invertible matrices. Condition (i) then becomes that all orthogonal transformations of X1 are contained in the restricted support

SX1(ML,MR)={U:A˜SA,B˜SBandU˜SX2suchthatA˜U=MLand(U,U˜)B˜=MR}. (3.2)

Theorem 3.6.

Suppose SA=On and SB=Ip, p1np and X is full rank (i.e., rank(X) = n), then for any POn, PX1SX1(ML,MR). In other words, SX1(ML,MR)=OnX1=OnML.

Proof.

We need to show that, for any POn, U=PX1SX1(ML,MR).

Since POn and AOn, then A˜=APTOn. Then we have

A˜U=APTPX1=AX1=ML. (3.3)

Since X is full-rank and np, there exists a (pn) × p matrix X∗ such that (XX) is full-rank and thus invertible. Since POn, (PTXX) is also full-rank and invertible. Hence we can define an invertible matrix

B˜=(XX)1(PTXX)B.

Also let U˜=PX2 thus (U,U˜)=PX, we have

((U,U˜)X)B˜=(P00I)(XX)B˜=(P00I)(PTXX)B=(XX)B,

where I is the identity matrix of size (pn) × (pn). The first p rows in the last equation are

(U,U˜)B˜=XB=MR. (3.4)

(3.3) and (3.4) together means that U satisfies (3.2), thus U belongs to SX1(ML,MR). □

Theorem 3.6 states that condition (i) is satisfied when the X is full rank. In the original TM2 scheme proposal, the full rank condition may or may not be satisfied because it is determined by the underlying probability distribution of X1 which is outside the control of the designer of this procedure. With the modification of extra noise matrix X2, we can ensure the full rank condition by specifying the noise generation mechanism. Particularly, we specify that each individual data provider generates a p2-dimension noise vector with i.i.d. elements from a Gaussian distribution with p2n. This will ensure with probability one that X is indeed full rank.

Remark 1. (Size of the right mask)

For privacy preservation, the size of right mask p has to be bigger than the data size n as assumed in Theorem 3.6. When p < n, some rows of MR are linear dependent which provides further restriction on the support. We provide such a counter example in Appendix A to illustrate that such a restriction together with knowledge of data type can reveal individual level data.

Above we considered the support restriction under the original TM2 scheme proposal [Wu et al., 2017b] of invertible right mask, SB=Ip. However, unlike Op, Ip does not form a compact Hausdorff topological group. Therefore, there exists no uniform distribution on Ip. Due to the non-uniformity of B, the posterior distribution of X1 given (ML, MR) leaks information beyond the support restriction, thus the second stage condition (ii) no longer holds. This makes the usage of random invertible right masks in the TM2 scheme very tricky. It is unclear what distribution on Ip should be used to generate the random invertible B.

Here, we consider the modification of the TM2 scheme where the right mask B is a random orthogonal matrix generated from the uniform distribution π0 on Op. We show that if the random noise X2 is large enough, then condition (i) still holds when the orthogonal right mask B is used.

Let λmin(M) and λmax(M) denotes the minimum and the maximum eigenvalues of a semi-positive definite matrix M. The restricted support will remain big if the noise is large enough:

λmin(MRMRTX1X1T)=λmin(X2X2T)>λmax(X1X1T). (3.5)

Now we have a result similar to Theorem 3.6.

Theorem 3.7.

Suppose SA=On and SB=Op, p1np. If condition (3.5) holds, then SX1(ML,MR)=OnX1.

The proof is provided in Appendix B.

Next, we show that condition (ii) also holds when under condition (3.5). Then we discuss how achievable the technical condition (3.5) is in practice.

3.3. Information Leakage Beyond the Support Restriction.

We now study the second stage condition (ii) by checking the amount of information an adversary can get from the posterior distribution of X1 given (ML, MR) beyond their restriction on the support of X1. Given INFO = (ML, MR), the posterior density is denoted as πX1(ML,MR)(x1mL,mR). The prior density πX1 restricted on the support SX1(mL,mR) is denoted as πX1SX1(mL,mR).

Theorem 3.8.

Let X1 be a random matrix with probability density πX1. We assume that the elements in X2 are generated i.i.d. from a Gaussian distribution with mean zero. When condition (3.5) holds, given ML and MR, the posterior density of X1 is the same as the prior density restricted on SX1(ML,MR). That is,

πX1(ML,MR)(x1mL,mR)=πX1SX1(mL,mR)(x1). (3.6)

The proof of Theorem 3.8 is provided in Appendix D.

3.4. ϵ-strong obfuscating TM2.

Theorem 3.7 and Theorem 3.8 states, respectively, that conditions (i) and (ii) hold under condition (3.5). Combining them, we have the following Theorem.

Theorem 3.9.

If

Pr[λmin(X2X2T)>λmax(X1X1T)]1ϵ, (3.7)

then the proposed TM2 procedure is ϵ-strong collection obfuscating by Definition 3.2.

The ϵ-strong collection privacy property ensures that there is at most ϵ probability for the process to leak any privacy information beyond the public released data AX1. TM2 achieves this property when the technical condition (3.7) holds. To achieve the technical condition (3.7), we generate the p2-dimensional noise vector x2 with i.i.d. Gaussian elements of mean zero and a sufficiently large variance σ2. We present a technical probability bound in Appendix C, where the probability of violating condition (3.5) is decreasing exponentially and specifics a σ2 value which ensures condition (3.7). Larger variance σ2 always increases the probability that condition (3.5) holds. In practice, the variance σ2 is only limited due to the computation accuracy. That is, σ should not exceed raw data values by the orders of magnitude allowed by the machine precision.

3.5. Extension to Alleviate Collusion Risks.

We have shown that the privacy of individual data can be protected when no party in the TM2 scheme knows all the masks. However, there are also risks of collusion among different parties in the procedure. Since the right mask B is known to the data collector and all individual data providers, if one of them share this info with the masking services provider, then the privacy protection can be broken.

Wu et al. [2017a] proposed ideas to protect against this collusion risk using the ideas of multiparty computation. For each individual, the data vector x can be broken up as K1 random components x1,,xK1 where x=x1++xK1. Then such components are sent to K1 right masking service providers, one to each. The resulting masked data xiBi, i = 1, ..., K1, are then sent to the left masking service provider to be merged and created the double masked data AXiBi, i = 1, ..., K1. For further protection, they can be passed through K2 left masking service providers to generate AK2A1XiBi. Let A=AK2AK21A1. Then the masked data AXiBi, i = 1, ..., K1, are sent to the corresponding right masking service providers to remove the right masking. Then the resulting AXi, i = 1, ..., K1, are sent to the data collector to generate AX=AX1++AXK1. Unless all K1 right (or all K2 left) masking service providers collude, they cannot find values of all components X1,,XK1.

The stage one theoretical analysis on this extended TM2 scheme can be analyzed similarly as before, where the restricted support condition (i*) holds given condition (3.5). The stage two analysis is more involved, as the posterior distribution of X1 given some shares, depends on the distribution of the shares. Which random distributions should the shares be generated from to effect no additional privacy loss remains an open question, and will be investigated in future work.

4. Discussions and Conclusions

This paper conducts a theoretical analysis of privacy preservation in a modified TM2 scheme. Random noises were used with uniformly distributed orthogonal matrix masks to hide individual data during the data collection process. The noise addition in the first step of the TM2 scheme is similar to the idea of noise perturbed response schemes. However, the critical difference is that our noise addition is only intended to help mask data during the transition, and is in fact removed after the right mask removal. The resulting published data set is a left masked data set with exact summary statistics, unlike many other noise addition schemes where the summary statistics are randomly approximated.

This work is aimed to protect against unscrupulous access to the raw data X1 traditionally hold by a trusted operator. We would like to further clarify the relationship to differential privacy methods [Dwork et al., 2006] which aims to provide a strong privacy protection and closure under composition of multiple accesses to the database. There are two types of differential privacy models. In the central model, a trusted database operator holds the raw data, and releases noise perturbed summary statistics for inquires. In the local model [Evfimievski et al., 2003, Dwork et al., 2006, Kasiviswanathan et al., 2011, Cormode et al., 2018], noise is added at the individual level based on the idea of randomized response methods [Warner, 1965, Blair et al., 2015]. The local differential privacy procedures similarly addresses the issue of untrustworthy central database operator. In recent years, Goolge [Erlingsson et al., 2014], Apple [Thakurta et al., 2017] and Microsoft [Ding et al., 2017] have all developed and deployed local differential privacy procedures in data collection.

There are two type of possible unscrupulous access to the raw data X1 to be addressed. The first is that the data collector is untrustworthy. The second is that an unscrupulous party might break in to the server containing data collected by an honest data collector. In the differential privacy literature, the first type is handled by using local differential privacy procedures, while the second type is addressed via pan-private data analysis [Dwork et al., 2010]. Our TM2 scheme protects against both type of unscrupulous accesses, but only allow for a one-shot collection for each individual’s data.

While both the local differential privacy procedures and the TM2 scheme can provide protection against unscrupulous accesses, the goals are somewhat different. The TM2 scheme aims to collect a masked data set that preserves the first two statistical moments of the variables (note that X1TX1 is knowable from the publicly available AX1). This allows exact statistical inferences on quantities depending on these statistical moments. The local differential privacy methods, on the other hand, aims to provide a stronger privacy protection under composition of multiple data collections/accesses.

The idea of the TM2 scheme is similar to secure multi-party computation (SMC) procedures, in that this scheme tries to distribute information among parties so that each party does not get access to individual level data other than its own. There are also important differences between TM2 and SMC. They differ in their designed purposes even though both want parties to cooperate in a joint task while keeping privacy. SMC is designed to conduct joint statistical analysis without the parties revealing their data to each other. TM2 wants to collect the masked data set, which enables statistical analysis, without parties revealing the actual data to the data collector. Operationally, SMC requires distributed storage of data as well as distributed computation. Specifically, if we require that the private data of parties never leave their devices, then SMC needs the parties to stand by ready for any statistical analysis that may occur much later in the future. In contrast, the TM2 method is only distributive in the data collection stage. The private data leaves the parties’ devices in a masked form, and later is centrally stored in masked form AX1. Since all future statistical analysis is conducted on the publicly released AX1, there is no need for the parties to be available for future analysis.

In this paper, we presented a privacy analysis clearly separating the risks coming from support restriction and the risks of probabilistic attacks beyond the support restriction. With the analysis, we show that the TM2 scheme is safe to collect a synthetic data set AX1 which is a random orthogonal transformation of the raw data set X1. All information during the data collection procedure is masked, and no one during the procedure can access the raw data set. This removes the issue of trusting a data record keeper and provides a new tool for researchers to collect data allowing exact statistical inference for linear models while provide a privacy protection: no hacking attack against a party in the data collection procedure can access real individual level data since all parties do not have enough information to infer the private individual data.

Figure 1:

Figure 1:

This diagram shows each party’s knowledge about the data and the masking matrices in the modified TM2 method. Each party knows some masked version of the data: XB for the masking service provider, A2X for the data collector, and A1A2X1 for everybody including the public. Nobody knows the original data X1, with each data provider (participant) knowing only his/her row x1

Acknowledgment

Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM118737, and by National Science Foundation under Award Number CNS-1563697.

Appendix A. A Counter Example

We illustrate that pn and the full rank condition on X are needed for the privacy preservation in the TM2 scheme through a simple counter example here. We consider a 3 × 2 matrix X, where the first column X1 contains binary sensitive information and the second column X2 contains continuous random noise. Suppose that only one of the three individuals answered “1” on the sensitive question, so that the data matrix is

X=(x11x12x21x22x31x32)=(1a10a20a3). (A.1)

We decompose X=(XaXb) with the first two rows as Xa and the last row as Xb. Without loss of generality, we assume that a2 ≠ 0 so that (1a10a2) is non-singular, and we assume that a3/a2 is not an integer. Then the first column X1 = (1, 0, 0)T can be uniquely determined from the masked data MR.

To see this, we decompose MR=(MaMb) similarly as in the decomposition of X=(XaXb). Then MbMa1=(XbB)(XaB)1=XbBB1Xa1=XbXa1 is known to anyone with access to MR. Using (A.1), this means XbXa1=(0,a3/a2) is determined from the masked data MR. Then the first element in (XbXa1)Xa=Xb indicates that (0, a3/a2)(x11, x21)T = x31. That is, (a3/a2)x21 = x31.

Since x21 and x31 are binary entries in X1 and a3/a2 is not an integer, the attacker can infer from (a3/a2)x21 = x31 that x21 = x31 = 0. Then we must have x11 = 1 due to MR (and thus X) being full-rank. That is, we now know every entry in X1 = (1, 0, 0)T from the masked data MR.

Notice that according to Lemma 3.3, a strong collection obfuscating procedure would not have allowed this identification of individual data from the random permutation. There is indeed additional privacy loss without assuming p < n. In general, when p < n and MR is full rank, (MbMa1)Xa=Xb along with knowledge of the data type may leak sensitive information about original data X.

Appendix B. Proof of Theorem 3.7.

Proof.

Same arguments in proof of Theorem 3.6 shows that (3.3) holds. Therefore we only need to show that there exist X˜2 and B˜ satisfying (3.4): (PX1,U˜)B˜=XB=MR.

Using condition (3.5), we have

λmin(X1X1T+X2X2TPX1X1TPT)λmin(X1X1T)+λmin(X2X2T)λmax(PX1X1TPT)λmin(X2X2T)λmax(X1X1T)>0.

Hence X1X1T+X2X2TPX1X1TPT is a positive definite matrix. Therefore, there exists a matrix U˜ such that

U˜U˜T=X1X1T+X2X2TPX1X1TPT. (B.1)

This is equivalent to

(PX1,U˜)(PX1,U˜)T=U˜U˜T+PX1X1TPT=X1X1T+X2X2T=XXT=MRMRT. (B.2)

Now we apply a singular value decomposition on MR = SDV where SOn, VOp and D is a diagonal matrix with nonincreasing nonnegative diagonal elements. Then, due to (B.2), the singular decomposition of (PX1,U˜) is SDV˜ for a V˜Op. Therefore B˜=V˜TV is the orthogonal matrix satisfies (3.4). □

Appendix C. Bound on condition (3.7).

To achieve the condition (3.7) we can specify the noise distribution to have large noise values. Let xmax denote the largest possible absolute value of entries in X1. Then

λmax(X1X1T)=X122X1F2=i=1nj=1p1x1,ij2np1xmax2=Cn,

where ‖ · ‖2 and ‖ · ‖F are the operator norm and Frobenius norm respectively. Note that Cn=np1xmax2 is a known constant to the designer of the TM2 scheme. Condition (3.5) holds when λmin(X2X2T) exceeds this constant.

We require each data provider to generate a p2-dimensional random noise vector with i.i.d. Gaussian elements of mean zero and variance σ2. Assume that γ = p2/n > 1, when n → ∞, Corollary 13 in Ledoux et al. [2010] provides a probability bound on the λmin(X2X2T) for any δ > 0:

Pr[λmin(X2X2T)(γ1)2nσ2(1δ)]C0enδ3/2/C0,

for some constant C0. The bound on the right side decrease exponentially in n so that, for large n, it can be made smaller than ϵ for a δ < 1. Choosing σ2>Cn/[(γ1)2n(1δ)] will ensure that (3.7) holds.

Appendix D. Proof of Theorem 3.8.

Proof.

We study the posterior density

πX1(ML,MR)(x1mL,mR)=π(X1,ML,MR)(x1,mL,mR)SX1(mL,mR)π(X1,ML,MR)(x1*,mL,mR)dx1*, (D.1)

and compare it with the prior density πX1(x1) restricted on the support SX1(mL,MR).

Recall that the probability densities for X1, X2, A and B at values X1 = x1, X2 = x2, A = a and B = b are denoted respectively as πX1(x1), πX2(x2), πA(a) and πB(b). Due to the independence of the generation mechanism of these quantities, their joint density is

π(X1,X2,A,B)(x1,x2,a,b)=πX1(x1)πX2(x2)πA(a)πB(b), (D.2)

for (X1,X2,A,B)SX1×SX2×SA×SB.

Since the elements of X2 are i.i.d. from the Gaussian distribution N(0, σ2),

πX2(x2)=1(2πσ)np2e1in,1jp2x2,ij22σ2=f(x2F2),

where f(x)=1(2πσ)np2ex2σ2 and x2F2=1in,1jp2x2,ij2 with ‖ · ‖F denote the Frobenius norm. Thus the joint density becomes

π(X1,X2,A,B)(x1,x2,a,b)=πX1(x1)f(x2F2)πA(a)πB(b). (D.3)

We now plug (D.3) into (D.1) for calculation.

First, we calculate the numerator π(X1,ML,MR)(x1,mL,mR) in (D.1). We denote the restricted sample spaces of random variables A and B respectively given knowledge of some other quantities as:

SA(x1,mL)={a:ax1=mL},SB(x1,mR)={b:x2suchthat(x1,x2)b=mR}. (D.4)

Notice that given A = a and ML = mL, then X1 = aTmL. Also, given B = b and MR = mR, then (X1, X2) = mRbT so that

X2F2=trace(X2X2T)=trace[mRbTbmRTX1X1T]=trace(mRmRT)trace(X1X1T).

Hence given A = a, ML = mL and B = b, we have

X2F2=trace(mRmRT)trace(aTmLmLTa)=trace(mRmRT)trace(mLmLT).

Then using this and equation (D.3), we have

π(X1,ML,MB)(x1,mL,mR)=SA(x1,mL){SB(x1,mR)πX1(x1)f[trace(mRmRT)trace(mLmLT)]πA(a)πB(b)db}da=πX1(x1)f[trace(mRmRT)trace(mLmLT)]SA(x1,mL){SB(x1,mR)πA(a)πB(b)db}da=πX1(x1)f[trace(mRmRT)trace(mLmLT)]SA(x1,mL)[SB(x1,mR)πB(b)db]πA(a)da=πX1(x1)f[trace(mRmRT)trace(mLmLT)][SA(x1,mL)SA(x1,mL)SB(x1,mR)πA(a)da][SB(x1,mR)πB(b)db]. (D.5)

Now for any pair of x1 and x1* that both belongs to SX1(mL,mR), there exist (a, a) such that ax1=mL=ax1*. Denote A0 = (a)−1a∗. Then A0x1*=(a)1mL=x1 and A01x1=x1*. Hence for any a˜SA(x1,mL) we have

a˜A0x1*=a˜x1=mL,

i.e., a˜A0SA(x1*,mL). On the other hand, for any a¯SA(x1*,mL), a¯A01x1=a¯x1*=mL, i.e., a¯A01SA(x1,mL). Taken together, we have a one-to-one mapping between the two sets SA(x1,mL) and SA(x1*,mL). Particularly,

SA(x1*,mL)=SA(x1,mL)A0. (D.6)

Hence for uniform density πxA = π0, (D.6) and Lemma 3.4 implies that

SA(x1,mL)πA(a)da=SA(x1*,mL)πA(a)da. (D.7)

Plug (D.5) and (D.7) into (D.1) and cancel the common factors, we get

πX1(ML,MR)(x1mL,mR)=πX1(x1)[SB(x1,mR)πB(b)db]SX1(mL,mR)πX1(x1*)[SB(x1*,mR)πB(b)db]dx1*, (D.8)

Next, for any pair of x1 and x1* that both belongs to SX1(mL,mR), there exist (x2, b) and (x2*,b*) such that (x1,x2)b=mR=(x1*,x2*)b*. Let B0 = b(b∗)−1. Then (x1,x2)B0=(x1*,x2*). Similar to (D.6), we have

SB(x1,mR)=B0SB(x1*,mR). (D.9)

For uniform density πB = π0, using (D.9), Lemma 3.4 implies that

SB(x1,mR)πB(b)db=SB(x1*,mR)πB(b)db (D.10)

Plug-in (D.10) into equation (D.8), we have

πx1(ML,MR)(x1mL,mR)=πX1(x1)SX1(mL,mR)πX1(x1*)dx1*.

Contributor Information

A. ADAM DING, Department of Mathematics, Northeastern University, Boston, MA.

GUANHONG MIAO, Department of Biostatistics, University of Florida, Gainesville, FL..

SAMUEL S. WU, Department of Biostatistics, University of Florida, Gainesville, FL.

References

  1. Anderson TW, Olkin I, and Underhill LG Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987. doi: 10.1137/0908055. URL 10.1137/0908055. [DOI] [Google Scholar]
  2. Blair G, Imai K, and Zhou Y Design and analysis of the randomized response technique. Journal of the American Statistical Association, 110(511):1304–1319, 2015. [Google Scholar]
  3. Brand R Microdata protection through noise addition. In Inference control in statistical databases, pages 97–116. Springer, 2002. [Google Scholar]
  4. Burridge J Information preserving statistical obfuscation. Statistics and Computing, 13(4): 321–327, 2003. [Google Scholar]
  5. Cormode G, Jha S, Kulkarni T, Li N, Srivastava D, and Wang T Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pages 1655–1658. ACM, 2018. [Google Scholar]
  6. Ding B, Kulkarni J, and Yekhanin S Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pages 3571–3580, 2017. [Google Scholar]
  7. Drechsler J and Reiter JP Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association, 105(492): 1347–1357, 2010. [Google Scholar]
  8. Duchi JC, Jordan MI, and Wainwright MJ Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, (accepted), 2017. [Google Scholar]
  9. Dwork C Differential privacy in Bugliesi M, Preneel B, Sassone V, and Wegener I, eds., ICALP (2). Lecture Notes in Computer Science, 4052:1–12, 2006. [Google Scholar]
  10. Dwork C and Naor M On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. Journal of Privacy and Confidentiality, 2(1):8, 2008. [Google Scholar]
  11. Dwork C, McSherry F, Nissim K, and Smith A Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3–540-32731–2, 978–3-540–32731-8. doi: 10.1007/11681878_14. URL 10.1007/11681878_14. [DOI] [Google Scholar]
  12. Dwork C, Naor M, Pitassi T, Rothblum GN, and Yekhanin S Pan-private streaming algorithms. In ICS, pages 66–80, 2010. [Google Scholar]
  13. Erlingsson Ú, Pihur V, and Korolova A Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014. [Google Scholar]
  14. Evfimievski A, Gehrke J, and Srikant R Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ‘03, pages 211–222, New York, NY, USA, 2003. ACM. ISBN 1–58113-670–6. doi: 10.1145/773153.773174. URL 10.1145/773153.773174. [DOI] [Google Scholar]
  15. Fung BCM, Wang K, Chen R, and Yu PS Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv, 42(4):14:1–14:53, June 2010. ISSN 0360–0300. doi: 10.1145/1749603.1749605. URL 10.1145/1749603.1749605. [DOI] [Google Scholar]
  16. Heiberger RM Algorithm as 127: Generation of random orthogonal matrices. Journal of the Royal Statistical Society. Series C (Applied Statistics), 27(2):199–206, 1978. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346957. [Google Scholar]
  17. Huffington Post. Citigroup: $2.7 Million Stolen From Customers As Result Of Hacking. http://www.huffingtonpost.com/2011/06/27/citigroup-hack_n_885045.html. 2011.
  18. Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, and Smith A What can we learn privately? SIAM J. Comput, 40(3):793–826, June 2011. ISSN 0097–5397. doi: 10.1137/090756090. URL 10.1137/090756090. [DOI] [Google Scholar]
  19. Ledoux M, Rider B, et al. Small deviations for beta ensembles. Electronic Journal of Probability, 15:1319–1343, 2010. [Google Scholar]
  20. Li N, Li T, and Venkatasubramanian S t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106–115. IEEE, 2007. [Google Scholar]
  21. Liu K, Kargupta H, and Ryan J Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on knowledge and Data Engineering, 18(1):92–106, 2006. [Google Scholar]
  22. Machanavajjhala A, Kifer D, Gehrke J, and Venkitasubramaniam M L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1 (1):3, 2007. [Google Scholar]
  23. Reuters. Target To Pay $10 Million To Settle Lawsuit From Massive Data Breach. http://www.huffingtonpost.com/2015/03/18/target-hack-settlement_n_6899290.html. 2015.
  24. Reuters. Equifax Says Hack Potentially Exposed Details Of 143 Million Consumers. https://www.huffingtonpost.com/entry/equifax-says-hack-potentially-exposed-details-of-143-million-consumers_us_59b1bc2de4b0354e4410b33e. 2017.
  25. Rubin DB Satisfying confidentiality constraints through the use of synthetic multiply imputed microdata. Journal of Official Statistics, 9:461–468, 1993. [Google Scholar]
  26. Sweeney L k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. [Google Scholar]
  27. Thakurta AG, Vyrros AH, Vaishampayan US, Kapoor G, Freudiger J, Sridhar VR, and Davidson D Learning new words. Number US9594741B1. US Patent 9594741B1, March 2017.
  28. Ting D, Fienberg SE, and Trottini M Random orthogonal matrix masking methodology for microdata release. International Journal of Information and Computer Securityroke, 2 (1):86–105, 2008. [Google Scholar]
  29. Warner SL Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965. [PubMed] [Google Scholar]
  30. Wu SS, Chen S, Bhattacharjee A, and He Y Collusion resistant multi-matrix masking for privacy-preserving data collection. In IEEE 3rd International Conference on Big Data Security on Cloud (BigDataSecurity), pages 1–7. IEEE, 2017a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wu SS, Chen S, Burr DL, and Zhang L A new data collection technique for preserving privacy. Journal of Privacy and Confidentiality, 7(3):99–129, 2017b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhang L On security properties of Random Matrix Masking. PhD thesis, University of Florida, January 2014. URL http://search.proquest.com/docview/1876889174/. [Google Scholar]

RESOURCES