Abstract
To preserve privacy, the original data points (with exact values) are replaced by boxes containing each (inaccessible) data point. This privacy-motivated uncertainty leads to uncertainty in the statistical characteristics computed based on this data. In a previous paper, we described how to minimize this uncertainty under the assumption that we use the same standard statistical estimates for the desired characteristics. In this paper, we show that we can further decrease the resulting uncertainty if we allow fuzzy-motivated weighted estimates, and we explain how to optimally select the corresponding weights.
I. Formulation of the Problem
Need to preserve privacy
In many practical applications, e.g., in medicine and in education, to better serve customers, it is important to know as much as possible about the potential customers. Customers are often reluctant to share information, since this information can be potentially used against them. For example, age can be used by companies to (unlawfully) discriminate against older job applicants. It is thus important to preserve privacy when storing customer data; see, e.g., [6].
How to preserve privacy: k-anonymity and ℓ-diversity
To maintain privacy, we divide the space of all possible combinations of values x = (x1, …, xn) into boxes
| (1) |
For each record, instead of storing the actual values xi, we only store the label of the box B containing x.
To avoid further loss of privacy, it is important to make sure that location in a box does not identify a person. This is usually achieved by requiring that for some fixed integer k, each box contains at least k records.
It is also not good if all records within a box have the same value of an i-th quantity xi. It is thus required that for some integer ℓ, each box should contain at least ℓ different values of each xi; see, e.g., [1].
Statistical data processing
Based on the available data points (1 ≤ p ≤ N), we need to estimate averages Ei, variances , covariances Cij, correlations ρij, and other statistical characteristics. The means are usually estimated as follows:
| (2) |
The covariance is usually estimated as:
| (3) |
The variance is usually estimated by a formula
| (4) |
or, sometimes,
| (5) |
and the correlation is estimated as
| (6) |
Comment. We are interested in large databases, in which the number N of records is large. For large N, the difference between the usual un-biased estimate for Vi (with N − 1 in the denominator) and the estimate with N is negligible. To simplify computations, in this paper, by Vi and σi, we will mean the versions corresponding to N; our results can be easily be reformulated for the un-biased estimates, which in our terms take the form and .
In statistical data processing, privacy leads to uncertainty
To maintain privacy, we replace each numerical value with the corresponding interval. Different values from these intervals lead, in general, to different values of the resulting statistical characteristics. Hence, for each characteristic, we get a whole interval of possible values.
If this interval is too wide, the resulting range is useless (e.g., for correlation, the interval [−1, 1] is useless). It is therefore desirable to select, among all possible subdivisions into boxes which preserve k-anonymity (and ℓ-diversity), the one which leads to the narrowest intervals for the desired statistical characteristic.
What we do in this paper
In Section 2, following [7], we describe how this problem is solved now. Please note that because our objective is to generalize these formulas to the weighted case, the notations that we use in Section 2 are slightly different from the notations from [7].
Then, in Section 3, we explain how fuzzy-motivated ideas can improve the corresponding estimates.
II. How This Problem Is Solved Now
Estimating accuracy caused by privacy-based subdivision into boxes: case of k-anonymity
To minimize uncertainty, we select the smallest boxes. Hence, each box B should have exactly k records.
For each combination of values from the corresponding intervals , we get:
| (7) |
where each difference satisfies the inequality . When we have many records, boxes are small, so we can use a linear approximation:
| (8) |
where . The range of this linear expression is [C̃ − Δ, C̃ + Δ], where
| (9) |
Expressions for the corresponding partial derivatives
The estimate for the accuracy Δ is described in terms of partial derivatives of the statistical characteristic C. For the mean Ei, the derivative is equal to
| (10) |
For the variance Vi, we have
| (11) |
Therefore, for , we get
| (12) |
For the covariance Cij, we have
| (13) |
For the correlation ρij, we have:
| (14) |
For all these characteristics C, the derivative takes the form
| (15) |
for some expression bi(x).
Towards optimal subdivision into boxes
The overall expression for Δ is a sum of terms corresponding to different points. So, to minimize Δ, we must, for each point, minimize the corresponding term
| (16) |
Because of the relation between the partial derivatives and bi(x), this minimization is equivalent to minimizing the term , where we denoted .
The only constraint on the values Δi(x) is that the corresponding box should contain exactly k different points. The number of points can be obtained by multiplying the data density ρ(x) by the box volume . The data density can be estimated based on the data. So, we minimize the expression
| (17) |
under the constraint
| (18) |
(Asymptotically) optimal subdivision into boxes (case of k-anonymity)
The Lagrange multiplier technique leads to
| (19) |
for some c(x). From the constraint (18), we get
| (20) |
This means that around each point x, we need to select the box with half-widths
| (21) |
The resulting accuracy is equal to
| (22) |
where the sum is taken over all N data points x.
We need to dismiss rare points
In many practical situations, we have rare points, for which the smallest box containing k of them is huge. Such a big-size box will contribute a large amount of uncertainty to Δ; so we should dismiss such rare points.
If we select a subset S ⊂ {1, 2, …, N} of the set of N original points, then the privacy-related uncertainty reduces to
| (23) |
where #(S) denote the number of points in the set S. The statistical accuracy reduces to
| (24) |
(see, e.g., [5]). Minimizing the sum
| (25) |
leads to selecting all x with c(x) ≤ c0, where c0 minimizes the sum
| (26) |
Examples
For estimating the mean Ei, we have ai(x) = const and thus,
| (27) |
In this case, c(x) is a decreasing function of density. So, dismissing points with c(x) > c0 is equivalent to dismissing all the points with ρ(x) < ρ0 (for some ρ0).
For computing covariance Cij, the derivative is proportional to xi − Ei. Thus, the values ai(x) are proportional to |xi − Ei|. So, the upper threshold c0 on c(x) is equivalent to the lower threshold on the ratio
| (28) |
Hence, we can also use points x with small ρ(x), provided that if xi or xj is close to the corresponding mean. Using extra points x improves accuracy.
How to also take into account ℓ-diversity
Up to now, we only took into account the k-anonymity requirement. We also need to take into account that within each box, for each variable xi, there are ≥ ℓ different values of xi. To formalize this requirement, we first need to describe what “different” means.
Usually, for each variable i, different means that
| (29) |
for some threshold εi. Thus, ℓ different values means that 2Δi(x) ≥ ℓ · εi. So, the problem is to find Δi(x) such that
| (30) |
under the constraints
| (31) |
and
| (32) |
for all i.
According to [7], the solution to this optimization problem is as follows: around each point x, we first compute the values
| (33) |
If 2Δi(x) ≥ ℓ · εi for all i, we select Δi(x). Otherwise, we sort the quantities by ai(x) · εi:
| (34) |
Then, for each t from 1 to n, we compute
| (35) |
For each t, if , we compute
| (36) |
We select t for which Δ(t) is the smallest, and take:
for i ≤ t, and
for i > t.
Comment. The computation time of this algorithm is quadratic in n. This is OK, since the number n of different characteristics is usually reasonably small. What is important is that the algorithm is still linear-time in terms of the number of records N.
III. Fuzzy-Motivated Idea
Main idea
In [7], to improve the accuracy of the resulting estimate, we propose to ignore some data points while keeping other data points. In other words, we propose a crisp separation between data points that we keep and data points that we ignore. Fuzzy logic has taught us that in many cases, it is beneficial to replace such a crisp separation with a “fuzzy” one in which, instead of ignoring or keeping a data point, we take a data point with a certain degree; see, e.g., [2], [4], [8].
Implementing the idea
Specifically, instead of using the above formula for computing the statistical characteristics, in which all data points are treated equally, we assign a weight w(x) ≥ 0 to each data point so that ∑w(x) = 1, and use the weighted estimates for all the statistical characteristics:
| (37) |
| (38) |
Optimization problem
Our objective is to find the weights w(x) for which the resulting uncertainty is the smallest possible. Similarly to the crisp case, this uncertainty consists of two parts: the part coming from the privacy-motivated uncertainty and the part coming from the fact that the size is finite.
One can check that for privacy-motivated uncertainty, the corresponding derivatives are proportional to the weight w(x). For each box, we thus face the exact same optimization problem for finding the best sizes Δi(x) of the corresponding privacy-related box. As a result, for the overall privacy-motivated uncertainty, we get the expression .
For the statistical part: if we simply estimate the variance of the estimate for the mean Ei = ∑w(x) · xi, then, due to the fact that the variance of the sum of independent variables is equal to the sum of the variances, we conclude that the variance of this estimate is proportional to ∑ w2(x); see, e.g., [5]. Thus, the standard deviation of this estimate is proportional to
| (39) |
For the traditional equal-weight estimate, when
| (40) |
for all x, the proportionality coefficient becomes equal to the expression
| (41) |
that we used in Section 2.
One can check that, similarly, estimates for the accuracy of other statistical characteristics can be obtained from the estimates provided in Section 2 by replacing with , i.e., this part is equal to
| (42) |
Thus, to minimize the overall inaccuracy, we need to minimize the following sum:
| (43) |
under the constraints and w(x) ≥ 0.
Solving the resulting optimization problem: general idea
By applying the Lagrange multiplier method to the above constraint optimization problem, we can reduce this problem to the following unconstrained optimization problem:
| (44) |
for an appropriate Lagrange multiplier λ. Differentiating this objective function with respect to w(x) and equating the derivative to 0, we conclude that
| (45) |
i.e., that
| (46) |
To be more precise, since we require that w(x) ≥ 0, this formula only holds when n · c(x) ≤ λ; when n · c(x) > λ, we should get w(x) = 0.
Towards computing the auxiliary parameter λ
How can we find λ? By squaring both sides of this formula, we get
| (47) |
By adding left- and right-hand sides corresponding to different x, we get
| (48) |
Dividing both sides of this equality by , we conclude that
| (49) |
i.e., that
| (50) |
This is a quadratic equation in terms of λ, namely:
| (51) |
where e Ñ is the total number of points that we did not dismiss, i.e., for which n · c(x) < λ, and the sums are taken over all such points.
From this quadratic equation, we can find λ. Thus, we naturally arrive at the following iterative algorithm for computing λ.
Iterative algorithm for computing the auxiliary parameter λ
The goal of this algorithm is to find the threshold value λ, so that points x for which n · c(x) ≥ λ will be dismissed from our estimates (i.e., we would have w(x) = 0 for such points).
In the beginning, we do not have any reason to dismiss any values, so we start with the first approximation λ0.
On each iteration k, we start with the value λk−1 obtained on the previous iteration, and compute the next approximation λk as follows.
First, we compute the total numbers e Ñ of points x for which n · c(x) < λk−1.
Then, we compute the sums and over all such points.
Based on these values, we solve the quadratic equation (51) and find the next approximation λk.
We stop iterations when the process converges, i.e., when
| (52) |
Towards computing w(x)
We know, from the formula (46), that for those points for which n · c(x) < λ, we have
| (53) |
for some constant K. To find K, we can use the fact that . Substituting the expression (53) into this constraint, we conclude that
| (54) |
Since we have already computed the values e Ñ, λ, and when we computed λ, we can thus compute K.
So, we arrive at the following formula for computing the desired weights.
Formula for computing the optimal weights w(x)
By running the above iterative algorithm, we have computed the auxiliary value λ. In the process of computing λ, we have computed the values e Ñ and , where the sum is taken over all the points x for which n · c(x) < λ.
Now, we compute
| (55) |
The optimal weights can now be computed as follows:
when n · c(x) ≥ λ, the optimal weight is w(x) = 0;
- when n · c(x) < λ, the optimal weight is equal to
(56)
Comment. As expected, the larger the uncertainty contribution c(x) from a point, the smaller the weight with which we take this point. When this contribution is large enough (i.e., larger than the threshold determined by the auxiliary parameter λ), we completely ignore such points.
IV. Boxes Appropriate for Several Different Characteristics
What we provided before
In the previous sections, we described how, for each statistical characteristic C, we can find the boxes (i.e., data anonymization) that leads to the most accurate estimate of this selected characteristic.
Remaining problem
In practice, we may need to compute the values of different statistical characteristics. The problem is that optimal boxes corresponding to different characteristics C are, in general, different.
For example, boxes that lead to most accurate estimates e Ẽ of mean E may lead to very inaccurate estimates e C̃ij of correlation Cij, and vice versa.
Towards a possible solution to this problem
Based on the previous experience, we know how many times users were looking for values of different statistical characteristics; in other words, we know the probabilities of looking for different characteristics C.
We also know what accuracy is desirable for estimating each characteristic C. For example, we may fix the same relative error for all estimates, and take, e.g., if this relative error is 10%. Then, for each characteristic C, the accuracy of estimating this characteristic is better gauged not by the absolute accuracy ΔC but rather by the ratio
| (57) |
describing how close we are to the desired accuracy.
In this situation, a reasonable idea is to minimize average quality
| (58) |
Towards an algorithm
How can we solve the corresponding optimization problem? The objective function q has the form
| (59) |
i.e., the for
| (60) |
By changing the order of summation, we get an equivalent formula
| (61) |
This optimization problem is similar to the optimization problem corresponding to the case of a single statistical characteristic C, with the only difference that instead of the original partial derivatives , we use a weighted combination
| (62) |
of these derivatives.
In terms of the coefficients ai(x) introduced in Section 2, this means that instead of using the values corresponding to an individual characteristic C, we must use a linear combination of these values:
| (63) |
Resulting algorithm
Use the same algorithm(s) as in Sections 2 and 3, except that instead of the values corresponding to an individual statistical characteristic C, we should use the values (63).
Acknowledgment
Support for this project was provided by the National Institutes of Health (NIH), through a Small Business Innovation Research grant (award number 1R43TR000173-01) to Applied Biomathematics, but the views and opinions expressed herein should not be construed to be those of the National Institutes of Health.
The authors are thankful to the anonymous referees for valuable suggestions.
Contributor Information
G. Xiang, Email: gxiang@sigmaxi.net, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA.
S. Ferson, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA
L. Ginzburg, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA
L. Longpré, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA
E. Mayorga, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA
O. Kosheleva, Email: olgak@utep.edu, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA.
References
- 1.Ghinita G, Karras P, Kalnis P, Mamoulis N. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems. 2009;Vol. 34(No. 2) Article 9. [Google Scholar]
- 2.Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic. Upper Saddle River, New Jersey: Prentice Hall; 1995. [Google Scholar]
- 3.Nguyen HT, Kreinovich V, Wu B, Xiang G. Computing Statistics under Interval and Fuzzy Uncertainty. Springer Verlag; 2012. [Google Scholar]
- 4.Nguyen HT, Walker EA. First Course In Fuzzy Logic. Boca Raton, Florida: CRC Press; 2006. [Google Scholar]
- 5.Sheskin DJ. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Florida: Chapman & Hall/CRC; 2007. [Google Scholar]
- 6.Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based System. 2002;Vol. 10(No. 5):557–570. [Google Scholar]
- 7.Xiang G, Kreinovich V. Data anonymization that leads to the most accurate estimates of statistical characteristics; Proceedings of the IEEE Series of Symposia on Computational Intelligence SSCI’2013; April 16–19, 2013; Singapore. to appear. [Google Scholar]
- 8.Zadeh LA. Fuzzy sets. Information and Control. 1965;Vol. 8:338–353. [Google Scholar]
