Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Sep 1.
Published in final edited form as: Proc IFSA World Congr. 2013:611–616. doi: 10.1109/IFSA-NAFIPS.2013.6608471

Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics: Fuzzy-Motivated Approach

G Xiang 1, S Ferson 2, L Ginzburg 3, L Longpré 4, E Mayorga 5, O Kosheleva 6
PMCID: PMC4150686  NIHMSID: NIHMS479056  PMID: 25187183

Abstract

To preserve privacy, the original data points (with exact values) are replaced by boxes containing each (inaccessible) data point. This privacy-motivated uncertainty leads to uncertainty in the statistical characteristics computed based on this data. In a previous paper, we described how to minimize this uncertainty under the assumption that we use the same standard statistical estimates for the desired characteristics. In this paper, we show that we can further decrease the resulting uncertainty if we allow fuzzy-motivated weighted estimates, and we explain how to optimally select the corresponding weights.

I. Formulation of the Problem

Need to preserve privacy

In many practical applications, e.g., in medicine and in education, to better serve customers, it is important to know as much as possible about the potential customers. Customers are often reluctant to share information, since this information can be potentially used against them. For example, age can be used by companies to (unlawfully) discriminate against older job applicants. It is thus important to preserve privacy when storing customer data; see, e.g., [6].

How to preserve privacy: k-anonymity and ℓ-diversity

To maintain privacy, we divide the space of all possible combinations of values x = (x1, …, xn) into boxes

B=[1Δ1(x),1+Δ1(x)]××[nΔn(x),n+Δn(x)]. (1)

For each record, instead of storing the actual values xi, we only store the label of the box B containing x.

To avoid further loss of privacy, it is important to make sure that location in a box does not identify a person. This is usually achieved by requiring that for some fixed integer k, each box contains at least k records.

It is also not good if all records within a box have the same value of an i-th quantity xi. It is thus required that for some integer ℓ, each box should contain at least ℓ different values of each xi; see, e.g., [1].

Statistical data processing

Based on the available data points x(p)=(x1(p),,xn(p)) (1 ≤ pN), we need to estimate averages Ei, variances Vi=σi2, covariances Cij, correlations ρij, and other statistical characteristics. The means are usually estimated as follows:

Ei=1N·p=1Nxi(p),Ej=1N·p=1Nxj(p). (2)

The covariance is usually estimated as:

Cij=1N·p=1N(xi(p)Ei)·(xj(p)Ej). (3)

The variance is usually estimated by a formula

Vi=1N1·p=1N(xi(p)Ei)2 (4)

or, sometimes,

Vi=1N·p=1N(xi(p)Ei)2, (5)

and the correlation is estimated as

ρij=Cijσi·σj. (6)

Comment. We are interested in large databases, in which the number N of records is large. For large N, the difference between the usual un-biased estimate for Vi (with N − 1 in the denominator) and the estimate with N is negligible. To simplify computations, in this paper, by Vi and σi, we will mean the versions corresponding to N; our results can be easily be reformulated for the un-biased estimates, which in our terms take the form Vi·N1N and σi·N1N.

In statistical data processing, privacy leads to uncertainty

To maintain privacy, we replace each numerical value xi(p) with the corresponding interval. Different values from these intervals lead, in general, to different values of the resulting statistical characteristics. Hence, for each characteristic, we get a whole interval of possible values.

If this interval is too wide, the resulting range is useless (e.g., for correlation, the interval [−1, 1] is useless). It is therefore desirable to select, among all possible subdivisions into boxes which preserve k-anonymity (and ℓ-diversity), the one which leads to the narrowest intervals for the desired statistical characteristic.

What we do in this paper

In Section 2, following [7], we describe how this problem is solved now. Please note that because our objective is to generalize these formulas to the weighted case, the notations that we use in Section 2 are slightly different from the notations from [7].

Then, in Section 3, we explain how fuzzy-motivated ideas can improve the corresponding estimates.

II. How This Problem Is Solved Now

Estimating accuracy caused by privacy-based subdivision into boxes: case of k-anonymity

To minimize uncertainty, we select the smallest boxes. Hence, each box B should have exactly k records.

For each combination of values xi(p) from the corresponding intervals [i(p)Δi(p),i(p)+Δi(p)], we get:

C(x1(1),x2(1),,xn(N))=C(1(1)+Δx1(1),2(1)+Δx2(1),,n(N)+Δxn(N)), (7)

where each difference Δxk(p)=defxk(p)k(p) satisfies the inequality |Δxi(p)|Δi(p). When we have many records, boxes are small, so we can use a linear approximation:

C=+p=1Ni=1nCxi·Δxi(p), (8)

where =defC(1(1),2(1),,n(N)). The range of this linear expression is [ − Δ, + Δ], where

Δ=defp=1Ni=1n|Cxi|·Δi(p)=k·BxBi=1n|Cxi|·Δi(x). (9)

Expressions for the corresponding partial derivatives

The estimate for the accuracy Δ is described in terms of partial derivatives Cxi of the statistical characteristic C. For the mean Ei, the derivative is equal to

Eixi=1N. (10)

For the variance Vi, we have

Vixi=2·(xiEi)N. (11)

Therefore, for σi=Vi, we get

σixi=xiExN·σi. (12)

For the covariance Cij, we have

Cijxi=xjEjN. (13)

For the correlation ρij, we have:

ρijxi=1N·(xjEj)Cijσi2·(xiEi)σi·σj. (14)

For all these characteristics C, the derivative takes the form

Cxi=1N·bi(x) (15)

for some expression bi(x).

Towards optimal subdivision into boxes

The overall expression for Δ is a sum of terms corresponding to different points. So, to minimize Δ, we must, for each point, minimize the corresponding term

i=1n|Cxi|·Δi(x). (16)

Because of the relation between the partial derivatives and bi(x), this minimization is equivalent to minimizing the term i=1nai(x)·Δi(x), where we denoted ai(x)=def|bi(x)|.

The only constraint on the values Δi(x) is that the corresponding box should contain exactly k different points. The number of points can be obtained by multiplying the data density ρ(x) by the box volume i=1n(2Δi(x)). The data density can be estimated based on the data. So, we minimize the expression

i=1nai(x)·Δi(x) (17)

under the constraint

ρ(x)·2n·i=1nΔi(x)=k. (18)

(Asymptotically) optimal subdivision into boxes (case of k-anonymity)

The Lagrange multiplier technique leads to

Δi(x)=c(x)ai(x), (19)

for some c(x). From the constraint (18), we get

c(x)=12·kρ(x)·j=1naj(x)n. (20)

This means that around each point x, we need to select the box with half-widths

Δi(x)=12·kρ(x)n·j=1naj(x)nai(x). (21)

The resulting accuracy is equal to

Δ=nN·xc(x), (22)

where the sum is taken over all N data points x.

We need to dismiss rare points

In many practical situations, we have rare points, for which the smallest box containing k of them is huge. Such a big-size box will contribute a large amount of uncertainty to Δ; so we should dismiss such rare points.

If we select a subset S ⊂ {1, 2, …, N} of the set of N original points, then the privacy-related uncertainty reduces to

n#S·xSc(x), (23)

where #(S) denote the number of points in the set S. The statistical accuracy reduces to

A#(S) (24)

(see, e.g., [5]). Minimizing the sum

n#(S)·xSc(x)+A#(S) (25)

leads to selecting all x with c(x) ≤ c0, where c0 minimizes the sum

n#{x:c(x)c0}·x:c(x)c0c(x)+A#{x:c(x)c0}. (26)

Examples

For estimating the mean Ei, we have ai(x) = const and thus,

c(x)=const·1ρ(x)n. (27)

In this case, c(x) is a decreasing function of density. So, dismissing points with c(x) > c0 is equivalent to dismissing all the points with ρ(x) < ρ0 (for some ρ0).

For computing covariance Cij, the derivative is proportional to xiEi. Thus, the values ai(x) are proportional to |xiEi|. So, the upper threshold c0 on c(x) is equivalent to the lower threshold on the ratio

ρ(x)|xiEi|·|xjEj|. (28)

Hence, we can also use points x with small ρ(x), provided that if xi or xj is close to the corresponding mean. Using extra points x improves accuracy.

How to also take into account ℓ-diversity

Up to now, we only took into account the k-anonymity requirement. We also need to take into account that within each box, for each variable xi, there are ≥ ℓ different values of xi. To formalize this requirement, we first need to describe what “different” means.

Usually, for each variable i, different means that

|xixi|εi (29)

for some threshold εi. Thus, ℓ different values means that 2Δi(x) ≥ ℓ · εi. So, the problem is to find Δi(x) such that

i=1nai(x)·Δi(x)min (30)

under the constraints

i=1nΔi(x)k2n·ρ(x) (31)

and

2Δi(x)·εi (32)

for all i.

According to [7], the solution to this optimization problem is as follows: around each point x, we first compute the values

Δi(x)=12·kρ(x)n·j=1naj(x)nai(x). (33)

If 2Δi(x) ≥ ℓ · εi for all i, we select Δi(x). Otherwise, we sort the quantities by ai(x) · εi:

a1(x)·ε1a2(x)·ε2an(x)·εn. (34)

Then, for each t from 1 to n, we compute

ct=12·(k·i=t+1nai(x)ρ(x)·t·i=1tεi)1/(nt). (35)

For each t, if 2ctat+1(x)·εt+1, we compute

Δ(t)=def12··i=1tai(x)·εi+(nt)·ct. (36)

We select t for which Δ(t) is the smallest, and take:

  • Δi(x)=12··εi for it, and

  • Δi(x)=ctai(x) for i > t.

Comment. The computation time of this algorithm is quadratic in n. This is OK, since the number n of different characteristics is usually reasonably small. What is important is that the algorithm is still linear-time in terms of the number of records N.

III. Fuzzy-Motivated Idea

Main idea

In [7], to improve the accuracy of the resulting estimate, we propose to ignore some data points while keeping other data points. In other words, we propose a crisp separation between data points that we keep and data points that we ignore. Fuzzy logic has taught us that in many cases, it is beneficial to replace such a crisp separation with a “fuzzy” one in which, instead of ignoring or keeping a data point, we take a data point with a certain degree; see, e.g., [2], [4], [8].

Implementing the idea

Specifically, instead of using the above formula for computing the statistical characteristics, in which all data points are treated equally, we assign a weight w(x) ≥ 0 to each data point so that ∑w(x) = 1, and use the weighted estimates for all the statistical characteristics:

Ei=xw(x)·xi,σi2=xw(x)·(xiEi)2, (37)
Cij=xw(x)·(xiEi)·(xjEj),ρij=Cijσi·σj. (38)

Optimization problem

Our objective is to find the weights w(x) for which the resulting uncertainty is the smallest possible. Similarly to the crisp case, this uncertainty consists of two parts: the part coming from the privacy-motivated uncertainty and the part coming from the fact that the size is finite.

One can check that for privacy-motivated uncertainty, the corresponding derivatives Cxi are proportional to the weight w(x). For each box, we thus face the exact same optimization problem for finding the best sizes Δi(x) of the corresponding privacy-related box. As a result, for the overall privacy-motivated uncertainty, we get the expression n·xw(x)·c(x).

For the statistical part: if we simply estimate the variance of the estimate for the mean Ei = ∑w(x) · xi, then, due to the fact that the variance of the sum of independent variables is equal to the sum of the variances, we conclude that the variance of this estimate is proportional to ∑ w2(x); see, e.g., [5]. Thus, the standard deviation of this estimate is proportional to

xw2(x). (39)

For the traditional equal-weight estimate, when

w(x)=1#(S) (40)

for all x, the proportionality coefficient becomes equal to the expression

1#(S) (41)

that we used in Section 2.

One can check that, similarly, estimates for the accuracy of other statistical characteristics can be obtained from the estimates provided in Section 2 by replacing 1#(S) with xw2(x), i.e., this part is equal to

A·xw2(x). (42)

Thus, to minimize the overall inaccuracy, we need to minimize the following sum:

n·xw(x)·c(x)+A·xw2(x) (43)

under the constraints xw(x)=1 and w(x) ≥ 0.

Solving the resulting optimization problem: general idea

By applying the Lagrange multiplier method to the above constraint optimization problem, we can reduce this problem to the following unconstrained optimization problem:

n·xw(x)·c(x)+A·xw2(x)λ·(xw(x)1)min, (44)

for an appropriate Lagrange multiplier λ. Differentiating this objective function with respect to w(x) and equating the derivative to 0, we conclude that

n·c(x)+A·w(x)yw2(y)λ=0, (45)

i.e., that

w(x)=1A·(λn·c(x))·yw2(y). (46)

To be more precise, since we require that w(x) ≥ 0, this formula only holds when n · c(x) ≤ λ; when n · c(x) > λ, we should get w(x) = 0.

Towards computing the auxiliary parameter λ

How can we find λ? By squaring both sides of this formula, we get

w2(x)=1A2·(λn·c(x))2·yw2(y). (47)

By adding left- and right-hand sides corresponding to different x, we get

xw2(x)=1A2·(x(λn·c(x))2)·yw2(y). (48)

Dividing both sides of this equality by xw2(x)=yw2(y), we conclude that

1=1A2·x(λn·c(x))2, (49)

i.e., that

x(λn·c(x))2A2=0. (50)

This is a quadratic equation in terms of λ, namely:

Ñ·λ22λ·n·xc(x)+n2·xc2(x)A2=0, (51)

where e Ñ is the total number of points that we did not dismiss, i.e., for which n · c(x) < λ, and the sums are taken over all such points.

From this quadratic equation, we can find λ. Thus, we naturally arrive at the following iterative algorithm for computing λ.

Iterative algorithm for computing the auxiliary parameter λ

The goal of this algorithm is to find the threshold value λ, so that points x for which n · c(x) ≥ λ will be dismissed from our estimates (i.e., we would have w(x) = 0 for such points).

In the beginning, we do not have any reason to dismiss any values, so we start with the first approximation λ0.

On each iteration k, we start with the value λk−1 obtained on the previous iteration, and compute the next approximation λk as follows.

  • First, we compute the total numbers e Ñ of points x for which n · c(x) < λk−1.

  • Then, we compute the sums xc(x) and xc2(x) over all such points.

  • Based on these values, we solve the quadratic equation (51) and find the next approximation λk.

We stop iterations when the process converges, i.e., when

λk=λk1. (52)

Towards computing w(x)

We know, from the formula (46), that for those points for which n · c(x) < λ, we have

w(x)=K·(λc(x)), (53)

for some constant K. To find K, we can use the fact that xw(x)=1. Substituting the expression (53) into this constraint, we conclude that

1=K·(Ñ·λxc(x)). (54)

Since we have already computed the values e Ñ, λ, and xc(x) when we computed λ, we can thus compute K.

So, we arrive at the following formula for computing the desired weights.

Formula for computing the optimal weights w(x)

By running the above iterative algorithm, we have computed the auxiliary value λ. In the process of computing λ, we have computed the values e Ñ and xc(x), where the sum is taken over all the points x for which n · c(x) < λ.

Now, we compute

K=1Ñ·λxc(x). (55)

The optimal weights can now be computed as follows:

  • when n · c(x) ≥ λ, the optimal weight is w(x) = 0;

  • when n · c(x) < λ, the optimal weight is equal to
    w(x)=K·(λc(x)). (56)

Comment. As expected, the larger the uncertainty contribution c(x) from a point, the smaller the weight with which we take this point. When this contribution is large enough (i.e., larger than the threshold determined by the auxiliary parameter λ), we completely ignore such points.

IV. Boxes Appropriate for Several Different Characteristics

What we provided before

In the previous sections, we described how, for each statistical characteristic C, we can find the boxes (i.e., data anonymization) that leads to the most accurate estimate of this selected characteristic.

Remaining problem

In practice, we may need to compute the values of different statistical characteristics. The problem is that optimal boxes corresponding to different characteristics C are, in general, different.

For example, boxes that lead to most accurate estimates e of mean E may lead to very inaccurate estimates e ij of correlation Cij, and vice versa.

Towards a possible solution to this problem

Based on the previous experience, we know how many times users were looking for values of different statistical characteristics; in other words, we know the probabilities pC0(CpC=1) of looking for different characteristics C.

We also know what accuracy Δ0C is desirable for estimating each characteristic C. For example, we may fix the same relative error for all estimates, and take, e.g., Δ0C=0.1· if this relative error is 10%. Then, for each characteristic C, the accuracy of estimating this characteristic is better gauged not by the absolute accuracy ΔC but rather by the ratio

qC=defΔCΔ0C (57)

describing how close we are to the desired accuracy.

In this situation, a reasonable idea is to minimize average quality

q=defCpC·qC. (58)

Towards an algorithm

How can we solve the corresponding optimization problem? The objective function q has the form

q=CpCΔ0C·ΔC, (59)

i.e., the for

q=CpCΔ0C·p=1Ni=1n|Cxi|·Δi(p). (60)

By changing the order of summation, we get an equivalent formula

q=p=1Ni=1n(CpCΔ0C·|Cxi|)·Δi(p). (61)

This optimization problem is similar to the optimization problem corresponding to the case of a single statistical characteristic C, with the only difference that instead of the original partial derivatives Cxi, we use a weighted combination

CpCΔ0C·|Cxi| (62)

of these derivatives.

In terms of the coefficients ai(x) introduced in Section 2, this means that instead of using the values aiC(x) corresponding to an individual characteristic C, we must use a linear combination of these values:

ai(x)=CpCΔ0C·aiC(x). (63)

Resulting algorithm

Use the same algorithm(s) as in Sections 2 and 3, except that instead of the values aiC corresponding to an individual statistical characteristic C, we should use the values (63).

Acknowledgment

Support for this project was provided by the National Institutes of Health (NIH), through a Small Business Innovation Research grant (award number 1R43TR000173-01) to Applied Biomathematics, but the views and opinions expressed herein should not be construed to be those of the National Institutes of Health.

The authors are thankful to the anonymous referees for valuable suggestions.

Contributor Information

G. Xiang, Email: gxiang@sigmaxi.net, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA.

S. Ferson, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA

L. Ginzburg, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA

L. Longpré, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA

E. Mayorga, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA

O. Kosheleva, Email: olgak@utep.edu, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA.

References

  • 1.Ghinita G, Karras P, Kalnis P, Mamoulis N. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems. 2009;Vol. 34(No. 2) Article 9. [Google Scholar]
  • 2.Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic. Upper Saddle River, New Jersey: Prentice Hall; 1995. [Google Scholar]
  • 3.Nguyen HT, Kreinovich V, Wu B, Xiang G. Computing Statistics under Interval and Fuzzy Uncertainty. Springer Verlag; 2012. [Google Scholar]
  • 4.Nguyen HT, Walker EA. First Course In Fuzzy Logic. Boca Raton, Florida: CRC Press; 2006. [Google Scholar]
  • 5.Sheskin DJ. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Florida: Chapman & Hall/CRC; 2007. [Google Scholar]
  • 6.Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based System. 2002;Vol. 10(No. 5):557–570. [Google Scholar]
  • 7.Xiang G, Kreinovich V. Data anonymization that leads to the most accurate estimates of statistical characteristics; Proceedings of the IEEE Series of Symposia on Computational Intelligence SSCI’2013; April 16–19, 2013; Singapore. to appear. [Google Scholar]
  • 8.Zadeh LA. Fuzzy sets. Information and Control. 1965;Vol. 8:338–353. [Google Scholar]

RESOURCES