ON THE PRIVACY AND UTILITY PROPERTIES OF TRIPLE MATRIX-MASKING

A ADAM DING; GUANHONG MIAO; SAMUEL S WU

doi:10.29012/jpc.674

. Author manuscript; available in PMC: 2021 Nov 10.

Published in final edited form as: J Priv Confid. 2020 Jun;10(2):10.29012/jpc.674. doi: 10.29012/jpc.674

ON THE PRIVACY AND UTILITY PROPERTIES OF TRIPLE MATRIX-MASKING

A ADAM DING ¹, GUANHONG MIAO ², SAMUEL S WU ³

PMCID: PMC8580375 NIHMSID: NIHMS1751888 PMID: 34765907

Abstract

Privacy protection is an important requirement in many statistical studies. A recently proposed data collection method, triple matrix-masking, retains exact summary statistics without exposing the raw data at any point in the process. In this paper, we provide theoretical formulation and proofs showing that a modified version of the procedure is strong collection obfuscating: no party in the data collection process is able to gain knowledge of the individual level data, even with some partially masked data information in addition to the publicly published data. This provides a theoretical foundation for the usage of such a procedure to collect masked data that allows exact statistical inference for linear models, while preserving a well-defined notion of privacy protection for each individual participant in the study. This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator. We achieve this by splitting the trust between two parties, the “masking service provider” and the “data collector.”

Keywords: Data masking, privacy protection, matrix masking

1. Introduction

In the digital age, vast amount of data becomes available for research. At the same time, there is increasing pressure to protect the privacy of study subjects when their data is used. For medical research, the Health Insurance Portability and Accountability Act of 1996 and subsequent rulings have imposed legal requirements for privacy protection on the collection and handling of health data. Among other things, basic privacy protection measures include the removal of all personal identifiers when releasing data for use. However, simply removing the personal identifier variables does not prevent possible identification of the individual from other variables. To prevent the identification of an individual record, researchers have shown that released data should be aggregated to satisfy privacy conditions such as k-anonymity [Sweeney, 2002], l-diversity [Machanavajjhala et al., 2007] and t-closeness [Li et al., 2007].

However, releasing data only at aggregated levels severely restricts its usefulness in many research studies. Alternatively, methods are designed to release obfuscated micro-data that allows for the usual statistical analysis while preserving the privacy at individual levels. Some examples of such obfuscated micro-data publishing are: noise addition [Brand, 2002], multiple imputation[Rubin, 1993, Drechsler and Reiter, 2010], information preserving statistical obfuscation [Burridge, 2003], random projection based perturbation [Liu et al., 2006], random orthogonal matrix masking [Ting et al., 2008]. Particularly, in the random orthogonal matrix masking scheme, a masked data set AX is published, where X denotes the data matrix of real responses and A is a random orthogonal matrix. The published data AX keeps the exact values for sufficient statistics of linear models, thus allowing exact statistical inference for many standard data analysis methods [Ting et al., 2008, Wu et al., 2017b] while protecting privacy by denying the user’s direct access to the raw data X. While the above methods all protect the privacy of individual entries through publishing only the random perturbed micro-data, the privacy protection can be lost when multiple micro-data sets are combined from multiple inquires to the same database. Differential privacy is proposed to quantify the effectiveness of privacy protection of the random noise addition/perturbation schemes [Dwork et al., 2006, Dwork, 2006, Dwork and Naor, 2008] against multiple inquires to the database. Then the noise level can be adjusted to achieve a quantified tradeoff between inference efficacy and privacy preservation (measured by the differential privacy metric).

Traditionally, there is a trustworthy data collector/manager that collects raw data and ensure privacy protection by releasing the data sets with random perturbations. Such procedures however do not protect against attacks where an unscrupulous party has unauthorized access to the raw data set X kept by these centers. Such security breaks are becoming more common as shown by the recent well-publicized incidences involving hacking against databases at major retailers, banks and credit bureaus [Huffington Post, 2011, Reuters., 2015, 2017].

This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator, and provides a theoretical study of the triple matrix-masking (TM²) procedure [Wu et al., 2017b] that does not assume such a trustworthy data collector/manager. The TM² procedure is a multi-party collection and masking system that aims to collect and publish the random orthogonal masked data set AX. We prove that, assuming no collusion between parties, no party learns more than the orbit of the data matrix under the action of the orthogonal group. More specifically, given the view of a particular party, let $S$ be the set of data matrices that could possibly have resulted in that view. We show that $S$ contains the full orbit of the data matrix and that given any prior on the data matrix, the party’s posterior is simply their prior restricted to $S$ . We call data collection procedures with such properties as strong obfuscating since any extra information beyond AX available to a party does not help in further identifying the individual level data.

In the differential privacy literature, the issue of untrustworthy data collector can be dealt with using local differential privacy procedures [Kasiviswanathan et al., 2011], where noises are added to the individual data before passing to the data collector. The resulting synthetic data from differential privacy procedures, however, does not preserve exact statistics hence require special inference procedures designed to achieve optimal statistical inference Duchi et al. [2017]. Our TM² procedure provides an alternative where the published masked data exactly preserve any statistics of the data that are preserved under the action of the orthogonal group. This provides an useful utility that exact statistical inference for linear models are preserved, thus standard linear statistical inference procedures can be applied directly on the resulting synthetic data from the TM² procedure. On the other hand, the TM² procedure is only for a one-shot collection of each individual’s data. When the individual data providers are sampled in multiple independent collections by different data collectors, differential privacy procedures can measure and limit the privacy leakage for the composition of the multiple collections. The TM² procedure does not consider the privacy leakage for the composition of the multiple collections.

Section 2 describe the TM² procedure and two new modifications to make it strong obfuscating. The theoretical analysis is provided in Section 3. Section 4 provides a summary and more detailed discussions for the relationship of the TM² procedure to the differential privacy and multi party computation methods.

2. The Masked Data Collection Procedure TM²nd Its Modification

The privacy-preserving data collection scheme TM² was proposed first in Wu et al. [2017b] and later expanded by Wu et al. [2017a]. We describe our modified basic version of the TM² method here:

Step 1. The data collectors plan the data collection, create the database structure, program the data collection system. They randomly generate a p × p random orthogonal matrix B, which is distributed to the participants’ data collection devices.

Step 2. Each participant’s data x₁ (a vector of dimension p₁) is collected and merged with Gaussian noise x₂ (of dimension p₂) into a vector x = (x₁, x₂) of dimension p = p₁ + p₂. Then x is right multiplied by B on the participant’s device, and only the resulting masked data xB leaves the device and is sent to the masking service provider.

Step 3. The masking service provider generates another n × n random orthogonal matrix A₂. After receiving data from all participants, it combines the individual data xB into a n × p matrix XB, left multiplies by A₂ and sends the doubly masked data A₂XB to the data collectors.

Step 4. The data collectors multiply A₂XB by B⁻¹ to get back A₂X and take the first p₁ columns to get A₂X₁. Then the data collectors generate another n × n random orthogonal matrix A₁, left multiply it to A₂X₁, and publish AX₁ (where A = A₁A₂) which is accessible by all data users.

Detailed theoretical analysis of the privacy guarantee on the TM² method has been missing. This paper fills that gap by proving theoretically that this modified version of the TM² method is strong obfuscating. We prove the strong obfuscating guarantee by showing that: (A) the extra information that any party in the process owns will not allow the party to reduce the data domain (possible values of data) small enough to identify individual level data; and (B) there is no statistical information leakage beyond the domain restriction considered in (A).

Compared to the original TM² scheme in Wu et al. [2017b], we make two modifications on the TM² procedure. For the first modification, we add random Gaussian noise in Step 2. The data collector wants to collect p₁ variables on n individuals so that the real response matrix becomes X₁ of dimensions n × p₁. We ask each participant to generate p₂ pure Gaussian noise variables, on his/her device according to a fixed variance parameter σ². Hence, the full data matrix would be X = (X₁, X₂). For privacy protections proved in later sections, we require that p₁ < n ≤ p = p₁ + p₂. In Step 2 of this modified procedure, Gaussian noise x₂ is mingled with real response x₁ to provide protection in addition to the random mask B. In Step 4, after the collectors get back A₂X = (A₂X₁, A₂X₂), they separate the matrix and discard those noises. Therefore the published data set AX₁ with A = A₁A₂ still gives the exact summary statistic, as it is only masked by A without containing the added noise.

For the second modification, instead of using a random invertible matrix for the right mask B as originally proposed by Wu et al. [2017b], we use a random orthogonal matrix for the right mask B. As we will see in the privacy analysis in the next section, using an invertible matrix does make one part of the privacy proof easier. However, the other part of privacy proof depends on using a uniformly distributed random matrix to avoid information leakage that can lead to probabilistic attacks. While there is a natural uniform distribution on all orthogonal matrices that is well studied in literature, there is no natural uniform distribution on the set of all invertible matrices. The uniformly distributed orthogonal matrix B does provide sufficient privacy protection when combined with the addition of noise X₂.

2.1. Privacy Analysis of TM²etup.

To rigorously study the privacy protection issues in this data collection process, we analyze the information that can be accessed by each party and analyze whether such information allows inference of the individual level data.

First, we illustrate how to analyze the privacy protection assuming that the adversary only has access to the publicly published left-masked data set M_L = AX₁ where A is a random n × n orthogonal matrix. The issue becomes whether an adversary can identify individual level data knowing only M_L = m_L.

We consider the analysis in two stages. When given M_L = m_L, this restricts the possible values of X₁ and can thus reveal information. In the first stage, we consider whether this support restriction on X₁ (due to M_L = m_L) enables the identification of individual data. Let $S_{X_{1}}$ denote the support of X₁, and let $S_{X_{1}} (m_{L})$ denote the restricted support of X₁ given that M_L = m_L. The privacy preservation depends on the size of $S_{X_{1}} (m_{L})$ . For example, in an extreme case, if the $S_{X_{1}} (m_{L})$ contains only one matrix, then X₁ is known to everyone and data privacy cannot be protected. Generally we can show, in next section, that this restricted support $S_{X_{1}} (m_{L})$ is big enough so that identification of individual data is impossible.

In the second stage, we consider whether the adversary can learn any information beyond the restriction on support which was analyzed in the first stage. Such information can enable adversaries to launch probabilistic attacks [Machanavajjhala et al., 2007, Fung et al., 2010]. Fortunately, due to the independence between the mask A and the raw data X₁, we can show that the posterior density of X₁ given M_L = m_L is the same as the prior density of X₁ restricted to the support $S_{X_{1}} (m_{L})$ . Thus any loss of privacy is through the support restriction already studied in stage one. Therefore, knowing M_L = m_L does not identify individual level data.

Next, we consider the privacy protection for all parties involved in the whole TM² data collection process. That is, we conduct the above two-stage privacy protection analysis given all information available to one party in the process. The data collector and the masking service provider each have access to some intermediate masked data in addition to the public data. Hence, we need to analyze privacy protection for an adversary knowing this intermediate masked data together with the public data set M_L = AX₁.

The data collector knows, in addition to M_L = AX₁, the double masked data A₂XB. Since the data collector knows the masks A₁ and B, knowing these data A₁A₂X₁ and A₂XB are simply equivalent to knowing A₂X. Due to the fact that X₂ is purely noise independent of raw data X₁, the theoretical privacy analysis for the data collector knowing A₂X = (A₂X₁, A₂X₂) will have basically the same results as the analysis for the user with access only to M_L = AX₁.

The masking service provider has access to the right-masked data M_R = XB in addition to the public left-masked data M_L = AX₁. This information results in the most severe restriction on the support when compared to what resulted from knowledge by other parties. Thus, this is the weakest link for privacy preservation in the whole TM² data collection scheme. In Section 3, we present details of the two-stage privacy protection analysis when both M_L and M_R are known.

3. Theoretical Analysis of Privacy Preservation of TM²

3.1. Notations, Formalizations and Technical Preliminaries.

We denote the probability densities of random matrices X₁, X₂, A and B as $π_{X_{1}} (x_{1})$ , $π_{X_{2}} (x_{2})$ , π_A(a) and π_B(b) respectively. The supports of these distributions are denoted respectively as $S_{X_{1}}$ , $S_{X_{2}}$ , $S_{A}$ and $S_{B}$ .

We want to study, based on information INFO available to one party, what this party can infer about the individual level data. Here this INFO includes the publicly available final left-masked data M_L = AX₁ and some extra information available to the particular party. The restricted support of X₁ given INFO is denoted as $S_{X_{1}} (I N F O)$ which consists of all n × p matrices that can be the value of X₁ which is compatible with INFO.

For example, given only the public masked data INFO = M_L, the restricted support is

S_{X_{1}} (M_{L}) = {U : \exists \tilde{A} \in S_{A} such that \tilde{A} U = M_{L} .}

Let $O_{n}$ denote the set consisting of all orthogonal matrices. In the case of left masking with a random orthogonal matrix A, for any orthogonal $\bar{A} \in O_{n}$ , $U = \bar{A} X_{1}$ is compatible with INFO = M_L. That is, $S_{X_{1}} (M_{L}) = O_{n} X_{1}$ . To see this, let $\tilde{A} = A {\bar{A}}^{T}$ , then $\tilde{A} \in O_{n}$ and $\tilde{A} (\bar{A} X_{1}) = M_{L}$ . Here and throughout this paper, we use ^T to denote the transpose of a matrix.

For the strong obfuscating guarantee, we wish to show that the extra information available to the parties in the process does not cause any privacy loss more than the publicly released final left-masked data M_L = AX₁. We want to show that: stage one (i) the restricted support $S_{X_{1}} (I N F O)$ is the same as $S_{X_{1}} (M_{L}) = O_{n} X_{1}$ ; stage two (ii) the conditional probability distribution of X₁ given INFO is similar to the probability distribution of X₁ given support $S_{X_{1}} (I N F O)$ , thus there is no privacy loss through probability attacks beyond the loss from the support restriction considered in stage one.

We now formalize the precise mathematical statements to prove in stages one and two. More precisely for stage one, we hope that the restricted support is the same as if only the public left-masked data is available.

S_{X_{1}} (I N F O) = O_{n} X_{1},

(i)

for INFO available to any one party in the process. For the second stage, we denote $π_{X_{1} ∣ I N F O} (x_{1} ∣ I N F O)$ as the posterior distribution of X₁ given INFO. The prior density $π_{X_{1}}$ restricted on the support $S_{X_{1}} (I N F O)$ is

π_{X_{1} ∣ S} {_{_{X_{1}}}}_{(I N F O)} (x_{1}) = \frac{π_{X_{1}} (x_{1})}{\int_{S_{X_{1}} (I N F O)} π_{X_{1}} (x_{1}^{*}) d x_{1}^{*}} .

To show that there is no extra privacy loss beyond the support restriction considered in stage one, we prove that these two probability densities agree with each other. That is, we wish to prove

π_{X_{1} ∣ I N F O} (x_{1} ∣ I N F O) = π_{X_{1} ∣ S} {_{_{X_{1}}}}_{(I N F O)} (x_{1}) .

(ii)

Definition 3.1.

A data collection process is strong collection obfuscating if conditions (i) and (ii) hold for the information INFO available to any party in this process.

A slightly weaker version is that the above property holds with a high probability. Notice that the INFO available to any party in this process can be determined from the values of X₁, X₂, A and B which are generated respectively from distributions with densities $π_{X_{1}} (x_{1})$ , $π_{X_{2}} (x_{2})$ , π_A(a) and π_B(b). Thus such INFO is generated from a probability distribution defined by $π_{X_{1}} (x_{1})$ , $π_{X_{2}} (x_{2})$ , π_A(a) and π_B(b). We want that, with high probability from this probability distribution, the generated values of INFO satisfy conditions (i) and (ii).

Definition 3.2.

A data collection process is ϵ-strong collection obfuscating if, with probability at least 1 − ϵ, conditions (i) and (ii) hold for the information INFO available to any party in this process.

Our definition of the strong collection obfuscating procedure ensures that there is no privacy loss due to observations by any party in the process beyond those contained in the publicly released final data. This definition delineates the privacy protection in the collection process from the privacy protection in the publicly releasing of final data M_L = AX₁. The theoretical analysis concentrates on the soundness of the collection process.

Given the public left-masked data M_L = AX₁, the statistics $X_{1}^{T} X_{1}$ are released to the user. The user has the first two exact statistical moments and statistical models, such as linear regression, can be fitted exactly as if the user has the raw data set X₁. And the residuals are known up to an orthogonal matrix multiplication, therefore the usual statistical model diagnostics methods can also be carried out as if done on the raw data set.

For continuous data, the user cannot recover the individual level data since the user only sees a linear combination of all individuals’ data, and there is no utilizable statistical distributional information other than the prior (population) density $π_{X_{1}}$ . This ensures the privacy of individual data.

In practice, the types of elements in X₁ may also be known to the user. This can further restrict the support. We assume that the elements in data matrix X₁ are all encoded as numerical values (e.g., “yes/no” answer to a question may be encoded as 1 and 0). We consider that the type of data in each column, either as continuous/discrete/binary, is public knowledge. Let $S_{j}$ denote the support of the type of data in the j-th column of X₁. For example, if data are continuous, then $S_{j} = R$ ; if the data are binary, then $S_{j} = {0, 1}$ ; if the data are positive integers, then $S_{j} = {1, 2, \dots} = N^{+}$ . Knowing the type of data in each column would restrict the support of X₁ to

{\tilde{S}}_{X_{1}} = {U : (all entries of j -th column in U) \in S_{j} j = 1, \dots, p_{1} .} .

Then with knowledge of both INFO and types of data, the restricted support becomes the intersection of ${\tilde{S}}_{X_{1}}$ and $S_{X_{1}} (I N F O)$ ,

S_{X_{1}} (I N F O; T Y P E) = {\tilde{S}}_{X_{1}} \cap S_{X_{1}} (I N F O) .

Let $P_{n}$ denote the set of all permutation matrix P. Since all permutation matrices are orthogonal and permutation does not change the type of elements, we have the following Lemma:

Lemma 3.3.

For any strong obfuscating data collection process,

P_{n} X_{1} \subseteq [{\tilde{S}}_{X_{1}} \cap O_{n} X_{1}] = S_{X_{1}} (I N F O; T Y P E) .

Lemma 3.3 indicates that a strong collection obfuscating data collection process offers some privacy protection even when the data types are known. Since all permutations are in the $S_{X_{1}} (I N F O; T Y P E)$ , any individual cannot be identified here without extra side information. It is not clear whether the type can be combined with some side information (such as data that a particular individual is a smoker) to reveal other individual level data. However, notice that any weakness in this aspect is inherently due to releasing the public data AX₁. Our strong collection obfuscating procedure ensures that no extra privacy loss is added during the process beyond the privacy loss in releasing AX₁.

As we discussed in the previous section, the party with the most information during the TM² process is the masking service provider who knows INFO = (M_L, M_R). Therefore, in the next section, we study when (i) and (ii) hold for INFO = (M_L, M_R). Here we first state some technical preliminary results on the characterization of the uniform distribution on orthogonal matrices. These preliminary results are used in studying the second stage condition (ii) later.

Under the matrix multiplication, the orthogonal matrices form a compact Hausdorff topological group $O_{n}$ . Therefore, there is a unique Haar measure μ(·) on $O_{n}$ such that the measure of the whole sample space $O_{n}$ equals one. Then this Haar measure induces a natural uniform distribution on $O_{n}$ . See Chapter 2 of Zhang [2014] for a detailed technical equivalent characterization of the uniform distribution on $O_{n}$ . Since a Haar measure μ(·) is invariant under the matrix multiplication, the uniform distribution is also invariant under the matrix multiplication.

Lemma 3.4.

Let π₀(·) denote the probability density function of the uniform distribution on $O_{n}$ . Then for any orthogonal matrix $A_{0} \in O_{n}$ ,

π_{0} (a) = π_{0} (A_{0} a) = π_{0} (a A_{0}), f o r a l l a \in O_{n} .

(3.1)

Also, the product of two uniformly distributed orthogonal matrices is also uniformly distributed.

Lemma 3.5.

If A₁ ~ π₀ and A₂ ~ π₀ are independent of each other, then their product A = A₁A₂ also follows the uniform distribution π₀ on $O_{n}$ .

The proof is straightforward and can be found in Chapter 2 of Zhang [2014].

In the TM² scheme, when the masking service provider and the data collector each generate a random orthogonal matrix A₁ and A₂ respectively according to π₀, then the mask A = A₁A₂ for the publicly released data set is also uniformly distributed. In practice, the uniformly distributed random orthogonal matrices can be generated using algorithms described in Heiberger [1978], Anderson et al. [1987], Wu et al. [2017b].

3.2. Restricted Support Given Knowledge of Masked Data Sets.

We first prove that condition (i) holds for INFO = (M_L, M_R) when invertible matrices are used for right mask B as originally proposed by Wu et al. [2017b]. That is, $B \in I_{n}$ , where $I_{n}$ denote the set of all n × n invertible matrices. Condition (i) then becomes that all orthogonal transformations of X₁ are contained in the restricted support

S_{X_{1}} (M_{L}, M_{R}) = {U : \exists \tilde{A} \in S_{A}, \tilde{B} \in S_{B} and \tilde{U} \in S_{X_{2}} such that \tilde{A} U = M_{L} and (U, \tilde{U}) \tilde{B} = M_{R}} .

(3.2)

Theorem 3.6.

Suppose $S_{A} = O_{n}$ and $S_{B} = I_{p}$ , p₁ ≤ n ≤ p and X is full rank (i.e., rank(X) = n), then for any $P \in O_{n}$ , $P X_{1} \in S_{X_{1}} (M_{L}, M_{R})$ . In other words, $S_{X_{1}} (M_{L}, M_{R}) = O_{n} X_{1} = O_{n} M_{L}$ .

Proof.

We need to show that, for any $P \in O_{n}$ , $U = P X_{1} \in S_{X_{1}} (M_{L}, M_{R})$ .

Since $P \in O_{n}$ and $A \in O_{n}$ , then $\tilde{A} = A P^{T} \in O_{n}$ . Then we have

\tilde{A} U = A P^{T} P X_{1} = A X_{1} = M_{L} .

(3.3)

Since X is full-rank and n ≤ p, there exists a (p − n) × p matrix X∗ such that $(\begin{matrix} X \\ X * \end{matrix})$ is full-rank and thus invertible. Since $P \in O_{n}$ , $(\begin{matrix} P^{T} X \\ X * \end{matrix})$ is also full-rank and invertible. Hence we can define an invertible matrix

\tilde{B} = {(\begin{matrix} X \\ X * \end{matrix})}^{- 1} (\begin{matrix} P^{T} X \\ X * \end{matrix}) B .

Also let $\tilde{U} = P X_{2}$ thus $(U, \tilde{U}) = P X$ , we have

(\begin{matrix} (U, \tilde{U}) \\ X * \end{matrix}) \tilde{B} = (\begin{matrix} P & 0 \\ 0 & I \end{matrix}) (\begin{matrix} X \\ X * \end{matrix}) \tilde{B} = (\begin{matrix} P & 0 \\ 0 & I \end{matrix}) (\begin{matrix} P^{T} X \\ X * \end{matrix}) B = (\begin{matrix} X \\ X * \end{matrix}) B,

where I is the identity matrix of size (p − n) × (p − n). The first p rows in the last equation are

(U, \tilde{U}) \tilde{B} = X B = M_{R} .

(3.4)

(3.3) and (3.4) together means that U satisfies (3.2), thus U belongs to $S_{X_{1}} (M_{L}, M_{R})$ . □

Theorem 3.6 states that condition (i) is satisfied when the X is full rank. In the original TM² scheme proposal, the full rank condition may or may not be satisfied because it is determined by the underlying probability distribution of X₁ which is outside the control of the designer of this procedure. With the modification of extra noise matrix X₂, we can ensure the full rank condition by specifying the noise generation mechanism. Particularly, we specify that each individual data provider generates a p₂-dimension noise vector with i.i.d. elements from a Gaussian distribution with p₂ ≥ n. This will ensure with probability one that X is indeed full rank.

Remark 1. (Size of the right mask)

For privacy preservation, the size of right mask p has to be bigger than the data size n as assumed in Theorem 3.6. When p < n, some rows of M_R are linear dependent which provides further restriction on the support. We provide such a counter example in Appendix A to illustrate that such a restriction together with knowledge of data type can reveal individual level data.

Above we considered the support restriction under the original TM² scheme proposal [Wu et al., 2017b] of invertible right mask, $S_{B} = I_{p}$ . However, unlike $O_{p}$ , $I_{p}$ does not form a compact Hausdorff topological group. Therefore, there exists no uniform distribution on $I_{p}$ . Due to the non-uniformity of B, the posterior distribution of X₁ given (M_L, M_R) leaks information beyond the support restriction, thus the second stage condition (ii) no longer holds. This makes the usage of random invertible right masks in the TM² scheme very tricky. It is unclear what distribution on $I_{p}$ should be used to generate the random invertible B.

Here, we consider the modification of the TM² scheme where the right mask B is a random orthogonal matrix generated from the uniform distribution π₀ on $O_{p}$ . We show that if the random noise X₂ is large enough, then condition (i) still holds when the orthogonal right mask B is used.

Let λ_min(M) and λ_max(M) denotes the minimum and the maximum eigenvalues of a semi-positive definite matrix M. The restricted support will remain big if the noise is large enough:

λ_{m i n} (M_{R} M_{R}^{T} - X_{1} X_{1}^{T}) = λ_{m i n} (X_{2} X_{2}^{T}) > λ_{m a x} (X_{1} X_{1}^{T}) .

(3.5)

Now we have a result similar to Theorem 3.6.

Theorem 3.7.

Suppose $S_{A} = O_{n}$ and $S_{B} = O_{p}$ , p₁ ≤ n ≤ p. If condition (3.5) holds, then $S_{X_{1}} (M_{L}, M_{R}) = O_{n} X_{1}$ .

The proof is provided in Appendix B.

Next, we show that condition (ii) also holds when under condition (3.5). Then we discuss how achievable the technical condition (3.5) is in practice.

3.3. Information Leakage Beyond the Support Restriction.

We now study the second stage condition (ii) by checking the amount of information an adversary can get from the posterior distribution of X₁ given (M_L, M_R) beyond their restriction on the support of X₁. Given INFO = (M_L, M_R), the posterior density is denoted as $π_{X_{1} ∣ (M_{L}, M_{R})} (x_{1} ∣ m_{L}, m_{R})$ . The prior density $π_{X_{1}}$ restricted on the support $S_{X_{1}} (m_{L}, m_{R})$ is denoted as $π_{X_{1} ∣ S}_{_{X_{1}}} (m_{L}, m_{R})$ .

Theorem 3.8.

Let X₁ be a random matrix with probability density $π_{X_{1}}$ . We assume that the elements in X₂ are generated i.i.d. from a Gaussian distribution with mean zero. When condition (3.5) holds, given M_L and M_R, the posterior density of X₁ is the same as the prior density restricted on $S_{X_{1}} (M_{L}, M_{R})$ . That is,

π_{X_{1} ∣ (M_{L}, M_{R})} (x_{1} ∣ m_{L}, m_{R}) = π_{X_{1} ∣ S}_{_{X_{1}} (m_{L}, m_{R})} (x_{1}) .

(3.6)

The proof of Theorem 3.8 is provided in Appendix D.

3.4. ϵ-strong obfuscating TM².

Theorem 3.7 and Theorem 3.8 states, respectively, that conditions (i) and (ii) hold under condition (3.5). Combining them, we have the following Theorem.

Theorem 3.9.

P r [λ_{m i n} (X_{2} X_{2}^{T}) > λ_{m a x} (X_{1} X_{1}^{T})] \geq 1 - ϵ,

(3.7)

then the proposed TM² procedure is ϵ-strong collection obfuscating by Definition 3.2.

The ϵ-strong collection privacy property ensures that there is at most ϵ probability for the process to leak any privacy information beyond the public released data AX₁. TM² achieves this property when the technical condition (3.7) holds. To achieve the technical condition (3.7), we generate the p₂-dimensional noise vector x₂ with i.i.d. Gaussian elements of mean zero and a sufficiently large variance σ². We present a technical probability bound in Appendix C, where the probability of violating condition (3.5) is decreasing exponentially and specifics a σ² value which ensures condition (3.7). Larger variance σ² always increases the probability that condition (3.5) holds. In practice, the variance σ² is only limited due to the computation accuracy. That is, σ should not exceed raw data values by the orders of magnitude allowed by the machine precision.

3.5. Extension to Alleviate Collusion Risks.

We have shown that the privacy of individual data can be protected when no party in the TM² scheme knows all the masks. However, there are also risks of collusion among different parties in the procedure. Since the right mask B is known to the data collector and all individual data providers, if one of them share this info with the masking services provider, then the privacy protection can be broken.

Wu et al. [2017a] proposed ideas to protect against this collusion risk using the ideas of multiparty computation. For each individual, the data vector x can be broken up as K₁ random components $x^{1}, \dots, x^{K_{1}}$ where $x = x^{1} + \dots + x^{K_{1}}$ . Then such components are sent to K₁ right masking service providers, one to each. The resulting masked data xⁱB_i, i = 1, ..., K₁, are then sent to the left masking service provider to be merged and created the double masked data AXⁱB_i, i = 1, ..., K₁. For further protection, they can be passed through K₂ left masking service providers to generate $A_{K_{2}} \dots A_{1} X^{i} B_{i}$ . Let $A = A_{K_{2}} A_{K_{2} - 1} \dots A_{1}$ . Then the masked data AXⁱB_i, i = 1, ..., K₁, are sent to the corresponding right masking service providers to remove the right masking. Then the resulting AXⁱ, i = 1, ..., K₁, are sent to the data collector to generate $A X = A X^{1} + \dots + A X^{K_{1}}$ . Unless all K₁ right (or all K₂ left) masking service providers collude, they cannot find values of all components $X^{1}, \dots, X^{K_{1}}$ .

The stage one theoretical analysis on this extended TM² scheme can be analyzed similarly as before, where the restricted support condition (i*) holds given condition (3.5). The stage two analysis is more involved, as the posterior distribution of X₁ given some shares, depends on the distribution of the shares. Which random distributions should the shares be generated from to effect no additional privacy loss remains an open question, and will be investigated in future work.

4. Discussions and Conclusions

This paper conducts a theoretical analysis of privacy preservation in a modified TM² scheme. Random noises were used with uniformly distributed orthogonal matrix masks to hide individual data during the data collection process. The noise addition in the first step of the TM² scheme is similar to the idea of noise perturbed response schemes. However, the critical difference is that our noise addition is only intended to help mask data during the transition, and is in fact removed after the right mask removal. The resulting published data set is a left masked data set with exact summary statistics, unlike many other noise addition schemes where the summary statistics are randomly approximated.

This work is aimed to protect against unscrupulous access to the raw data X₁ traditionally hold by a trusted operator. We would like to further clarify the relationship to differential privacy methods [Dwork et al., 2006] which aims to provide a strong privacy protection and closure under composition of multiple accesses to the database. There are two types of differential privacy models. In the central model, a trusted database operator holds the raw data, and releases noise perturbed summary statistics for inquires. In the local model [Evfimievski et al., 2003, Dwork et al., 2006, Kasiviswanathan et al., 2011, Cormode et al., 2018], noise is added at the individual level based on the idea of randomized response methods [Warner, 1965, Blair et al., 2015]. The local differential privacy procedures similarly addresses the issue of untrustworthy central database operator. In recent years, Goolge [Erlingsson et al., 2014], Apple [Thakurta et al., 2017] and Microsoft [Ding et al., 2017] have all developed and deployed local differential privacy procedures in data collection.

There are two type of possible unscrupulous access to the raw data X₁ to be addressed. The first is that the data collector is untrustworthy. The second is that an unscrupulous party might break in to the server containing data collected by an honest data collector. In the differential privacy literature, the first type is handled by using local differential privacy procedures, while the second type is addressed via pan-private data analysis [Dwork et al., 2010]. Our TM² scheme protects against both type of unscrupulous accesses, but only allow for a one-shot collection for each individual’s data.

While both the local differential privacy procedures and the TM² scheme can provide protection against unscrupulous accesses, the goals are somewhat different. The TM² scheme aims to collect a masked data set that preserves the first two statistical moments of the variables (note that $X_{1}^{T} X_{1}$ is knowable from the publicly available AX₁). This allows exact statistical inferences on quantities depending on these statistical moments. The local differential privacy methods, on the other hand, aims to provide a stronger privacy protection under composition of multiple data collections/accesses.

The idea of the TM² scheme is similar to secure multi-party computation (SMC) procedures, in that this scheme tries to distribute information among parties so that each party does not get access to individual level data other than its own. There are also important differences between TM² and SMC. They differ in their designed purposes even though both want parties to cooperate in a joint task while keeping privacy. SMC is designed to conduct joint statistical analysis without the parties revealing their data to each other. TM² wants to collect the masked data set, which enables statistical analysis, without parties revealing the actual data to the data collector. Operationally, SMC requires distributed storage of data as well as distributed computation. Specifically, if we require that the private data of parties never leave their devices, then SMC needs the parties to stand by ready for any statistical analysis that may occur much later in the future. In contrast, the TM² method is only distributive in the data collection stage. The private data leaves the parties’ devices in a masked form, and later is centrally stored in masked form AX₁. Since all future statistical analysis is conducted on the publicly released AX₁, there is no need for the parties to be available for future analysis.

In this paper, we presented a privacy analysis clearly separating the risks coming from support restriction and the risks of probabilistic attacks beyond the support restriction. With the analysis, we show that the TM² scheme is safe to collect a synthetic data set AX₁ which is a random orthogonal transformation of the raw data set X₁. All information during the data collection procedure is masked, and no one during the procedure can access the raw data set. This removes the issue of trusting a data record keeper and provides a new tool for researchers to collect data allowing exact statistical inference for linear models while provide a privacy protection: no hacking attack against a party in the data collection procedure can access real individual level data since all parties do not have enough information to infer the private individual data.

Figure 1: — *This diagram shows each party’s knowledge about the data and the masking matrices in the modified TM*² *method. Each party knows some masked version of the data:* XB *for the masking service provider,* A₂*X for the data collector, and* A₁A₂X₁ *for everybody including the public. Nobody knows the original data* X₁*, with each data provider (participant) knowing only his/her row x*₁

Acknowledgment

Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM118737, and by National Science Foundation under Award Number CNS-1563697.

Appendix A. A Counter Example

We illustrate that p ≥ n and the full rank condition on X are needed for the privacy preservation in the TM² scheme through a simple counter example here. We consider a 3 × 2 matrix X, where the first column X₁ contains binary sensitive information and the second column X₂ contains continuous random noise. Suppose that only one of the three individuals answered “1” on the sensitive question, so that the data matrix is

X = (\begin{array}{l} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{array}) = (\begin{array}{l} 1 & a_{1} \\ 0 & a_{2} \\ 0 & a_{3} \end{array}) .

(A.1)

We decompose $X = (\begin{array}{l} X_{a} \\ X_{b} \end{array})$ with the first two rows as X_a and the last row as X_b. Without loss of generality, we assume that a₂ ≠ 0 so that $(\begin{array}{l} 1 & a_{1} \\ 0 & a_{2} \end{array})$ is non-singular, and we assume that a₃/a₂ is not an integer. Then the first column X₁ = (1, 0, 0)^T can be uniquely determined from the masked data M_R.

To see this, we decompose $M_{R} = (\begin{array}{l} M_{a} \\ M_{b} \end{array})$ similarly as in the decomposition of $X = (\begin{array}{l} X_{a} \\ X_{b} \end{array})$ . Then $M_{b} M_{a}^{- 1} = (X_{b} B) {(X_{a} B)}^{- 1} = X_{b} B B^{- 1} X_{a}^{- 1} = X_{b} X_{a}^{- 1}$ is known to anyone with access to M_R. Using (A.1), this means $X_{b} X_{a}^{- 1} = (0, a_{3} / a_{2})$ is determined from the masked data M_R. Then the first element in $(X_{b} X_{a}^{- 1}) X_{a} = X_{b}$ indicates that (0, a₃/a₂)(x₁₁, x₂₁)^T = x₃₁. That is, (a₃/a₂)x₂₁ = x₃₁.

Since x₂₁ and x₃₁ are binary entries in X₁ and a₃/a₂ is not an integer, the attacker can infer from (a₃/a₂)x₂₁ = x₃₁ that x₂₁ = x₃₁ = 0. Then we must have x₁₁ = 1 due to M_R (and thus X) being full-rank. That is, we now know every entry in X₁ = (1, 0, 0)^T from the masked data M_R.

Notice that according to Lemma 3.3, a strong collection obfuscating procedure would not have allowed this identification of individual data from the random permutation. There is indeed additional privacy loss without assuming p < n. In general, when p < n and M_R is full rank, $(M_{b} M_{a}^{- 1}) X_{a} = X_{b}$ along with knowledge of the data type may leak sensitive information about original data X.

Appendix B. Proof of Theorem 3.7.

Proof.

Same arguments in proof of Theorem 3.6 shows that (3.3) holds. Therefore we only need to show that there exist ${\tilde{X}}_{2}$ and $\tilde{B}$ satisfying (3.4): $(P X_{1}, \tilde{U}) \tilde{B} = X B = M_{R}$ .

Using condition (3.5), we have

λ_{m i n} (X_{1} X_{1}^{T} + X_{2} X_{2}^{T} - P X_{1} X_{1}^{T} P^{T}) \geq λ_{m i n} (X_{1} X_{1}^{T}) + λ_{m i n} (X_{2} X_{2}^{T}) - λ_{m a x} (P X_{1} X_{1}^{T} P^{T}) \geq λ_{m i n} (X_{2} X_{2}^{T}) - λ_{m a x} (X_{1} X_{1}^{T}) >0 .

Hence $X_{1} X_{1}^{T} + X_{2} X_{2}^{T} - P X_{1} X_{1}^{T} P^{T}$ is a positive definite matrix. Therefore, there exists a matrix $\tilde{U}$ such that

\tilde{U} {\tilde{U}}^{T} = X_{1} X_{1}^{T} + X_{2} X_{2}^{T} - P X_{1} X_{1}^{T} P^{T} .

(B.1)

This is equivalent to

(P X_{1}, \tilde{U}) {(P X_{1}, \tilde{U})}^{T} = \tilde{U} {\tilde{U}}^{T} + P X_{1} X_{1}^{T} P^{T} = X_{1} X_{1}^{T} + X_{2} X_{2}^{T} = X X^{T} = M_{R} M_{R}^{T} .

(B.2)

Now we apply a singular value decomposition on M_R = SDV where $S \in O_{n}$ , $V \in O_{p}$ and D is a diagonal matrix with nonincreasing nonnegative diagonal elements. Then, due to (B.2), the singular decomposition of $(P X_{1}, \tilde{U})$ is $S D \tilde{V}$ for a $\tilde{V} \in O_{p}$ . Therefore $\tilde{B} = {\tilde{V}}^{T} V$ is the orthogonal matrix satisfies (3.4). □

Appendix C. Bound on condition (3.7).

To achieve the condition (3.7) we can specify the noise distribution to have large noise values. Let x_max denote the largest possible absolute value of entries in X₁. Then

λ_{m a x} (X_{1} X_{1}^{T}) = {‖ X_{1} ‖}_{2}^{2} \leq {‖ X_{1} ‖}_{F}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{p_{1}} x_{1, i j}^{2} \leq n p_{1} x_{m a x}^{2} = C_{n},

where ‖ · ‖₂ and ‖ · ‖_F are the operator norm and Frobenius norm respectively. Note that $C_{n} = n p_{1} x_{m a x}^{2}$ is a known constant to the designer of the TM² scheme. Condition (3.5) holds when $λ_{m i n} (X_{2} X_{2}^{T})$ exceeds this constant.

We require each data provider to generate a p₂-dimensional random noise vector with i.i.d. Gaussian elements of mean zero and variance σ². Assume that γ = p₂/n > 1, when n → ∞, Corollary 13 in Ledoux et al. [2010] provides a probability bound on the $λ_{m i n} (X_{2} X_{2}^{T})$ for any δ > 0:

P r [λ_{m i n} (X_{2} X_{2}^{T}) \leq {(\sqrt{γ} - 1)}^{2} n σ^{2} (1 - δ)] \leq C_{0} e^{- n δ^{3 / 2} / C_{0}},

for some constant C₀. The bound on the right side decrease exponentially in n so that, for large n, it can be made smaller than ϵ for a δ < 1. Choosing $σ^{2} > C_{n} / [{(\sqrt{γ} - 1)}^{2} n (1 - δ)]$ will ensure that (3.7) holds.

Appendix D. Proof of Theorem 3.8.

Proof.

We study the posterior density

π_{X_{1} ∣ (M_{L}, M_{R})} (x_{1} ∣ m_{L}, m_{R}) = \frac{π_{(X_{1}, M_{L}, M_{R})} (x_{1}, m_{L}, m_{R})}{\int_{{_{S}}_{_{X_{1}}} (m_{L}, m_{R})} π_{(X_{1}, M_{L}, M_{R})} (x_{1}^{*}, m_{L}, m_{R}) d x_{1}^{*}},

(D.1)

and compare it with the prior density $π_{X_{1}} (x_{1})$ restricted on the support $S_{X_{1}} (m_{L}, M_{R})$ .

Recall that the probability densities for X₁, X₂, A and B at values X₁ = x₁, X₂ = x₂, A = a and B = b are denoted respectively as $π_{X_{1}} (x_{1})$ , $π_{X_{2}} (x_{2})$ , π_A(a) and π_B(b). Due to the independence of the generation mechanism of these quantities, their joint density is

π_{(X_{1}, X_{2}, A, B)} (x_{1}, x_{2}, a, b) = π_{X_{1}} (x_{1}) π_{X_{2}} (x_{2}) π_{A} (a) π_{B} (b),

(D.2)

for $(X_{1}, X_{2}, A, B) \in S_{X_{1}} \times S_{X_{2}} \times S_{A} \times S_{B}$ .

Since the elements of X₂ are i.i.d. from the Gaussian distribution N(0, σ²),

π_{X_{2}} (x_{2}) = \frac{1}{{(\sqrt{2 π} σ)}^{n p_{2}}} e^{- \frac{\sum_{1 \leq i \leq n, 1 \leq j \leq p_{2}} x_{2, i j}^{2}}{2 σ^{2}}} = f ({‖ x_{2} ‖}_{F}^{2}),

where $f (x) = \frac{1}{{(\sqrt{2 π} σ)}^{n p_{2}}} e^{- \frac{x}{2 σ^{2}}}$ and ${‖ x_{2} ‖}_{F}^{2} = \sum_{1 \leq i \leq n, 1 \leq j \leq p_{2}} x_{2, i j}^{2}$ with ‖ · ‖_F denote the Frobenius norm. Thus the joint density becomes

π_{(X_{1}, X_{2}, A, B)} (x_{1}, x_{2}, a, b) = π_{X_{1}} (x_{1}) f ({‖ x_{2} ‖}_{F}^{2}) π_{A} (a) π_{B} (b) .

(D.3)

We now plug (D.3) into (D.1) for calculation.

First, we calculate the numerator $π_{(X_{1}, M_{L}, M_{R})} (x_{1}, m_{L}, m_{R})$ in (D.1). We denote the restricted sample spaces of random variables A and B respectively given knowledge of some other quantities as:

\begin{array}{l} S_{A} (x_{1}, m_{L}) = {a : a x_{1} = m_{L}}, \\ S_{B} (x_{1}, m_{R}) = {b : \exists x_{2} such that (x_{1}, x_{2}) b = m_{R}} . \end{array}

(D.4)

Notice that given A = a and M_L = m_L, then X₁ = a^Tm_L. Also, given B = b and M_R = m_R, then (X₁, X₂) = m_Rb^T so that

{‖ X_{2} ‖}_{F}^{2} = t r a c e (X_{2} X_{2}^{T}) = t r a c e [m_{R} b^{T} b m_{R}^{T} - X_{1} X_{1}^{T}] = t r a c e (m_{R} m_{R}^{T}) - t r a c e (X_{1} X_{1}^{T}) .

Hence given A = a, M_L = m_L and B = b, we have

{‖ X_{2} ‖}_{F}^{2} = t r a c e (m_{R} m_{R}^{T}) - t r a c e (a^{T} m_{L} m_{L}^{T} a) = t r a c e (m_{R} m_{R}^{T}) - t r a c e (m_{L} m_{L}^{T}) .

Then using this and equation (D.3), we have

π_{(X_{1}, M_{L}, M_{B})} (x_{1}, m_{L}, m_{R}) = \int_{S_{A} (x_{1}, m_{L})} {\int_{S_{B} (x_{1}, m_{R})} π_{X_{1}} (x_{1}) f [t r a c e (m_{R} m_{R}^{T}) - t r a c e (m_{L} m_{L}^{T})] π_{A} (a) π_{B} (b) d b} d a = π_{X_{1}} (x_{1}) f [t r a c e (m_{R} m_{R}^{T}) - t r a c e (m_{L} m_{L}^{T})] \int_{S_{A} (x_{1}, m_{L})} {\int_{S_{B} (x_{1}, m_{R})} π_{A} (a) π_{B} (b) d b} d a = π_{X_{1}} (x_{1}) f [t r a c e (m_{R} m_{R}^{T}) - t r a c e (m_{L} m_{L}^{T})] \int_{S_{A} (x_{1}, m_{L})} [\int_{S_{B} (x_{1}, m_{R})} π_{B} (b) d b] π_{A} (a) d a = π_{X_{1}} (x_{1}) f [t r a c e (m_{R} m_{R}^{T}) - t r a c e (m_{L} m_{L}^{T})] [\int_{S_{A} (x_{1}, m_{L})}^{S_{A} (x_{1}, m_{L}) S_{B} (x_{1}, m_{R})} π_{A} (a) d a] [\int_{S_{B} (x_{1}, m_{R})} π_{B} (b) d b] .

(D.5)

Now for any pair of x₁ and $x_{1}^{*}$ that both belongs to $S_{X_{1}} (m_{L}, m_{R})$ , there exist (a, a^∗) such that $a x_{1} = m_{L} = a * x_{1}^{*}$ . Denote A₀ = (a)⁻¹a∗. Then $A_{0} x_{1}^{*} = {(a)}^{- 1} m_{L} = x_{1}$ and $A_{0}^{- 1} x_{1} = x_{1}^{*}$ . Hence for any $\tilde{a} \in S_{A} (x_{1}, m_{L})$ we have

\tilde{a} A_{0} x_{1}^{*} = \tilde{a} x_{1} = m_{L},

i.e., $\tilde{a} A_{0} \in S_{A} (x_{1}^{*}, m_{L})$ . On the other hand, for any $\bar{a} \in S_{A} (x_{1}^{*}, m_{L})$ , $\bar{a} A_{0}^{- 1} x_{1} = \bar{a} x_{1}^{*} = m_{L}$ , i.e., $\bar{a} A_{0}^{- 1} \in S_{A} (x_{1}, m_{L})$ . Taken together, we have a one-to-one mapping between the two sets $S_{A} (x_{1}, m_{L})$ and $S_{A} (x_{1}^{*}, m_{L})$ . Particularly,

S_{A} (x_{1}^{*}, m_{L}) = S_{A} (x_{1}, m_{L}) A_{0} .

(D.6)

Hence for uniform density π_xA = π₀, (D.6) and Lemma 3.4 implies that

\int_{S_{A} (x_{1}, m_{L})} π_{A} (a) d a = \int_{S_{A} (x_{1}^{*}, m_{L})} π_{A} (a) d a .

(D.7)

Plug (D.5) and (D.7) into (D.1) and cancel the common factors, we get

\begin{array}{l} π_{X_{1} ∣ (M_{L}, M_{R})} (x_{1} ∣ m_{L}, m_{R}) \\ = \frac{π_{X_{1}} (x_{1}) [\int_{S_{B} (x_{1}, m_{R})} π_{B} (b) d b]}{\int_{S_{X_{1}} (m_{L}, m_{R})} π_{X_{1}} (x_{1}^{*}) [\int_{S_{B} (x_{1}^{*}, m_{R})} π_{B} (b) d b] d x_{1}^{*}}, \end{array}

(D.8)

Next, for any pair of x₁ and $x_{1}^{*}$ that both belongs to $S_{X_{1}} (m_{L}, m_{R})$ , there exist (x₂, b) and $(x_{2}^{*}, b *)$ such that $(x_{1}, x_{2}) b = m_{R} = (x_{1}^{*}, x_{2}^{*}) b *$ . Let B₀ = b(b∗)⁻¹. Then $(x_{1}, x_{2}) B_{0} = (x_{1}^{*}, x_{2}^{*})$ . Similar to (D.6), we have

S_{B} (x_{1}, m_{R}) = B_{0} S_{B} (x_{1}^{*}, m_{R}) .

(D.9)

For uniform density π_B = π₀, using (D.9), Lemma 3.4 implies that

\int_{S_{B} (x_{1}, m_{R})} π_{B} (b) d b = \int_{S_{B} (x_{1}^{*}, m_{R})} π_{B} (b) d b

(D.10)

Plug-in (D.10) into equation (D.8), we have

π_{x_{1} ∣ (M_{L}, M_{R})} (x_{1} ∣ m_{L}, m_{R}) = \frac{π_{X_{1}} (x_{1})}{\int_{S_{X_{1}} (m_{L}, m_{R})} π_{X_{1}} (x_{1}^{*}) d x_{1}^{*}} .

□

Contributor Information

A. ADAM DING, Department of Mathematics, Northeastern University, Boston, MA.

GUANHONG MIAO, Department of Biostatistics, University of Florida, Gainesville, FL..

SAMUEL S. WU, Department of Biostatistics, University of Florida, Gainesville, FL.

References

Anderson TW, Olkin I, and Underhill LG Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987. doi: 10.1137/0908055. URL 10.1137/0908055. [DOI] [Google Scholar]
Blair G, Imai K, and Zhou Y Design and analysis of the randomized response technique. Journal of the American Statistical Association, 110(511):1304–1319, 2015. [Google Scholar]
Brand R Microdata protection through noise addition. In Inference control in statistical databases, pages 97–116. Springer, 2002. [Google Scholar]
Burridge J Information preserving statistical obfuscation. Statistics and Computing, 13(4): 321–327, 2003. [Google Scholar]
Cormode G, Jha S, Kulkarni T, Li N, Srivastava D, and Wang T Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pages 1655–1658. ACM, 2018. [Google Scholar]
Ding B, Kulkarni J, and Yekhanin S Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pages 3571–3580, 2017. [Google Scholar]
Drechsler J and Reiter JP Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association, 105(492): 1347–1357, 2010. [Google Scholar]
Duchi JC, Jordan MI, and Wainwright MJ Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, (accepted), 2017. [Google Scholar]
Dwork C Differential privacy in Bugliesi M, Preneel B, Sassone V, and Wegener I, eds., ICALP (2). Lecture Notes in Computer Science, 4052:1–12, 2006. [Google Scholar]
Dwork C and Naor M On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. Journal of Privacy and Confidentiality, 2(1):8, 2008. [Google Scholar]
Dwork C, McSherry F, Nissim K, and Smith A Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3–540-32731–2, 978–3-540–32731-8. doi: 10.1007/11681878_14. URL 10.1007/11681878_14. [DOI] [Google Scholar]
Dwork C, Naor M, Pitassi T, Rothblum GN, and Yekhanin S Pan-private streaming algorithms. In ICS, pages 66–80, 2010. [Google Scholar]
Erlingsson Ú, Pihur V, and Korolova A Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014. [Google Scholar]
Evfimievski A, Gehrke J, and Srikant R Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ‘03, pages 211–222, New York, NY, USA, 2003. ACM. ISBN 1–58113-670–6. doi: 10.1145/773153.773174. URL 10.1145/773153.773174. [DOI] [Google Scholar]
Fung BCM, Wang K, Chen R, and Yu PS Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv, 42(4):14:1–14:53, June 2010. ISSN 0360–0300. doi: 10.1145/1749603.1749605. URL 10.1145/1749603.1749605. [DOI] [Google Scholar]
Heiberger RM Algorithm as 127: Generation of random orthogonal matrices. Journal of the Royal Statistical Society. Series C (Applied Statistics), 27(2):199–206, 1978. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346957. [Google Scholar]
Huffington Post. Citigroup: $2.7 Million Stolen From Customers As Result Of Hacking. http://www.huffingtonpost.com/2011/06/27/citigroup-hack_n_885045.html. 2011.
Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, and Smith A What can we learn privately? SIAM J. Comput, 40(3):793–826, June 2011. ISSN 0097–5397. doi: 10.1137/090756090. URL 10.1137/090756090. [DOI] [Google Scholar]
Ledoux M, Rider B, et al. Small deviations for beta ensembles. Electronic Journal of Probability, 15:1319–1343, 2010. [Google Scholar]
Li N, Li T, and Venkatasubramanian S t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106–115. IEEE, 2007. [Google Scholar]
Liu K, Kargupta H, and Ryan J Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on knowledge and Data Engineering, 18(1):92–106, 2006. [Google Scholar]
Machanavajjhala A, Kifer D, Gehrke J, and Venkitasubramaniam M L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1 (1):3, 2007. [Google Scholar]
Reuters. Target To Pay $10 Million To Settle Lawsuit From Massive Data Breach. http://www.huffingtonpost.com/2015/03/18/target-hack-settlement_n_6899290.html. 2015.
Reuters. Equifax Says Hack Potentially Exposed Details Of 143 Million Consumers. https://www.huffingtonpost.com/entry/equifax-says-hack-potentially-exposed-details-of-143-million-consumers_us_59b1bc2de4b0354e4410b33e. 2017.
Rubin DB Satisfying confidentiality constraints through the use of synthetic multiply imputed microdata. Journal of Official Statistics, 9:461–468, 1993. [Google Scholar]
Sweeney L k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. [Google Scholar]
Thakurta AG, Vyrros AH, Vaishampayan US, Kapoor G, Freudiger J, Sridhar VR, and Davidson D Learning new words. Number US9594741B1. US Patent 9594741B1, March 2017.
Ting D, Fienberg SE, and Trottini M Random orthogonal matrix masking methodology for microdata release. International Journal of Information and Computer Securityroke, 2 (1):86–105, 2008. [Google Scholar]
Warner SL Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965. [PubMed] [Google Scholar]
Wu SS, Chen S, Bhattacharjee A, and He Y Collusion resistant multi-matrix masking for privacy-preserving data collection. In IEEE 3rd International Conference on Big Data Security on Cloud (BigDataSecurity), pages 1–7. IEEE, 2017a. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu SS, Chen S, Burr DL, and Zhang L A new data collection technique for preserving privacy. Journal of Privacy and Confidentiality, 7(3):99–129, 2017b. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang L On security properties of Random Matrix Masking. PhD thesis, University of Florida, January 2014. URL http://search.proquest.com/docview/1876889174/. [Google Scholar]

[R1] Anderson TW, Olkin I, and Underhill LG Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625–629, 1987. doi: 10.1137/0908055. URL 10.1137/0908055. [DOI] [Google Scholar]

[R2] Blair G, Imai K, and Zhou Y Design and analysis of the randomized response technique. Journal of the American Statistical Association, 110(511):1304–1319, 2015. [Google Scholar]

[R3] Brand R Microdata protection through noise addition. In Inference control in statistical databases, pages 97–116. Springer, 2002. [Google Scholar]

[R4] Burridge J Information preserving statistical obfuscation. Statistics and Computing, 13(4): 321–327, 2003. [Google Scholar]

[R5] Cormode G, Jha S, Kulkarni T, Li N, Srivastava D, and Wang T Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pages 1655–1658. ACM, 2018. [Google Scholar]

[R6] Ding B, Kulkarni J, and Yekhanin S Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pages 3571–3580, 2017. [Google Scholar]

[R7] Drechsler J and Reiter JP Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association, 105(492): 1347–1357, 2010. [Google Scholar]

[R8] Duchi JC, Jordan MI, and Wainwright MJ Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, (accepted), 2017. [Google Scholar]

[R9] Dwork C Differential privacy in Bugliesi M, Preneel B, Sassone V, and Wegener I, eds., ICALP (2). Lecture Notes in Computer Science, 4052:1–12, 2006. [Google Scholar]

[R10] Dwork C and Naor M On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. Journal of Privacy and Confidentiality, 2(1):8, 2008. [Google Scholar]

[R11] Dwork C, McSherry F, Nissim K, and Smith A Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3–540-32731–2, 978–3-540–32731-8. doi: 10.1007/11681878_14. URL 10.1007/11681878_14. [DOI] [Google Scholar]

[R12] Dwork C, Naor M, Pitassi T, Rothblum GN, and Yekhanin S Pan-private streaming algorithms. In ICS, pages 66–80, 2010. [Google Scholar]

[R13] Erlingsson Ú, Pihur V, and Korolova A Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014. [Google Scholar]

[R14] Evfimievski A, Gehrke J, and Srikant R Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ‘03, pages 211–222, New York, NY, USA, 2003. ACM. ISBN 1–58113-670–6. doi: 10.1145/773153.773174. URL 10.1145/773153.773174. [DOI] [Google Scholar]

[R15] Fung BCM, Wang K, Chen R, and Yu PS Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv, 42(4):14:1–14:53, June 2010. ISSN 0360–0300. doi: 10.1145/1749603.1749605. URL 10.1145/1749603.1749605. [DOI] [Google Scholar]

[R16] Heiberger RM Algorithm as 127: Generation of random orthogonal matrices. Journal of the Royal Statistical Society. Series C (Applied Statistics), 27(2):199–206, 1978. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/2346957. [Google Scholar]

[R17] Huffington Post. Citigroup: $2.7 Million Stolen From Customers As Result Of Hacking. http://www.huffingtonpost.com/2011/06/27/citigroup-hack_n_885045.html. 2011.

[R18] Kasiviswanathan SP, Lee HK, Nissim K, Raskhodnikova S, and Smith A What can we learn privately? SIAM J. Comput, 40(3):793–826, June 2011. ISSN 0097–5397. doi: 10.1137/090756090. URL 10.1137/090756090. [DOI] [Google Scholar]

[R19] Ledoux M, Rider B, et al. Small deviations for beta ensembles. Electronic Journal of Probability, 15:1319–1343, 2010. [Google Scholar]

[R20] Li N, Li T, and Venkatasubramanian S t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106–115. IEEE, 2007. [Google Scholar]

[R21] Liu K, Kargupta H, and Ryan J Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on knowledge and Data Engineering, 18(1):92–106, 2006. [Google Scholar]

[R22] Machanavajjhala A, Kifer D, Gehrke J, and Venkitasubramaniam M L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1 (1):3, 2007. [Google Scholar]

[R23] Reuters. Target To Pay $10 Million To Settle Lawsuit From Massive Data Breach. http://www.huffingtonpost.com/2015/03/18/target-hack-settlement_n_6899290.html. 2015.

[R24] Reuters. Equifax Says Hack Potentially Exposed Details Of 143 Million Consumers. https://www.huffingtonpost.com/entry/equifax-says-hack-potentially-exposed-details-of-143-million-consumers_us_59b1bc2de4b0354e4410b33e. 2017.

[R25] Rubin DB Satisfying confidentiality constraints through the use of synthetic multiply imputed microdata. Journal of Official Statistics, 9:461–468, 1993. [Google Scholar]

[R26] Sweeney L k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. [Google Scholar]

[R27] Thakurta AG, Vyrros AH, Vaishampayan US, Kapoor G, Freudiger J, Sridhar VR, and Davidson D Learning new words. Number US9594741B1. US Patent 9594741B1, March 2017.

[R28] Ting D, Fienberg SE, and Trottini M Random orthogonal matrix masking methodology for microdata release. International Journal of Information and Computer Securityroke, 2 (1):86–105, 2008. [Google Scholar]

[R29] Warner SL Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965. [PubMed] [Google Scholar]

[R30] Wu SS, Chen S, Bhattacharjee A, and He Y Collusion resistant multi-matrix masking for privacy-preserving data collection. In IEEE 3rd International Conference on Big Data Security on Cloud (BigDataSecurity), pages 1–7. IEEE, 2017a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Wu SS, Chen S, Burr DL, and Zhang L A new data collection technique for preserving privacy. Journal of Privacy and Confidentiality, 7(3):99–129, 2017b. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zhang L On security properties of Random Matrix Masking. PhD thesis, University of Florida, January 2014. URL http://search.proquest.com/docview/1876889174/. [Google Scholar]

PERMALINK

ON THE PRIVACY AND UTILITY PROPERTIES OF TRIPLE MATRIX-MASKING

A ADAM DING

GUANHONG MIAO

SAMUEL S WU

Abstract

1. Introduction

2. The Masked Data Collection Procedure TM2nd Its Modification

2.1. Privacy Analysis of TM2etup.

3. Theoretical Analysis of Privacy Preservation of TM2

3.1. Notations, Formalizations and Technical Preliminaries.

Definition 3.1.

Definition 3.2.

Lemma 3.3.

Lemma 3.4.

Lemma 3.5.

3.2. Restricted Support Given Knowledge of Masked Data Sets.

Theorem 3.6.

Proof.

Remark 1. (Size of the right mask)

Theorem 3.7.

3.3. Information Leakage Beyond the Support Restriction.

Theorem 3.8.

3.4. ϵ-strong obfuscating TM2.

Theorem 3.9.

3.5. Extension to Alleviate Collusion Risks.

4. Discussions and Conclusions

Figure 1:

Acknowledgment

Appendix A. A Counter Example

Appendix B. Proof of Theorem 3.7.

Proof.

Appendix C. Bound on condition (3.7).

Appendix D. Proof of Theorem 3.8.

Proof.

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2. The Masked Data Collection Procedure TM²nd Its Modification

2.1. Privacy Analysis of TM²etup.

3. Theoretical Analysis of Privacy Preservation of TM²

3.4. ϵ-strong obfuscating TM².