Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics: Fuzzy-Motivated Approach

G Xiang; S Ferson; L Ginzburg; L Longpré; E Mayorga; O Kosheleva

doi:10.1109/IFSA-NAFIPS.2013.6608471

. Author manuscript; available in PMC: 2014 Sep 1.

Published in final edited form as: Proc IFSA World Congr. 2013:611–616. doi: 10.1109/IFSA-NAFIPS.2013.6608471

Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics: Fuzzy-Motivated Approach

G Xiang ¹, S Ferson ², L Ginzburg ³, L Longpré ⁴, E Mayorga ⁵, O Kosheleva ⁶

PMCID: PMC4150686 NIHMSID: NIHMS479056 PMID: 25187183

Abstract

To preserve privacy, the original data points (with exact values) are replaced by boxes containing each (inaccessible) data point. This privacy-motivated uncertainty leads to uncertainty in the statistical characteristics computed based on this data. In a previous paper, we described how to minimize this uncertainty under the assumption that we use the same standard statistical estimates for the desired characteristics. In this paper, we show that we can further decrease the resulting uncertainty if we allow fuzzy-motivated weighted estimates, and we explain how to optimally select the corresponding weights.

I. Formulation of the Problem

Need to preserve privacy

In many practical applications, e.g., in medicine and in education, to better serve customers, it is important to know as much as possible about the potential customers. Customers are often reluctant to share information, since this information can be potentially used against them. For example, age can be used by companies to (unlawfully) discriminate against older job applicants. It is thus important to preserve privacy when storing customer data; see, e.g., [6].

How to preserve privacy: k-anonymity and ℓ-diversity

To maintain privacy, we divide the space of all possible combinations of values x = (x₁, …, x_n) into boxes

B = [{x̃}_{1} - Δ_{1} (x), {x̃}_{1} + Δ_{1} (x)] \times \dots \times [{x̃}_{n} - Δ_{n} (x), {x̃}_{n} + Δ_{n} (x)] .

(1)

For each record, instead of storing the actual values x_i, we only store the label of the box B containing x.

To avoid further loss of privacy, it is important to make sure that location in a box does not identify a person. This is usually achieved by requiring that for some fixed integer k, each box contains at least k records.

It is also not good if all records within a box have the same value of an i-th quantity x_i. It is thus required that for some integer ℓ, each box should contain at least ℓ different values of each x_i; see, e.g., [1].

Statistical data processing

Based on the available data points $x^{(p)} = (x_{1}^{(p)}, \dots, x_{n}^{(p)})$ (1 ≤ p ≤ N), we need to estimate averages E_i, variances $V_{i} = σ_{i}^{2}$ , covariances C_ij, correlations ρ_ij, and other statistical characteristics. The means are usually estimated as follows:

E_{i} = \frac{1}{N} \cdot \sum_{p = 1}^{N} x_{i}^{(p)}, E_{j} = \frac{1}{N} \cdot \sum_{p = 1}^{N} x_{j}^{(p)} .

(2)

The covariance is usually estimated as:

C_{i j} = \frac{1}{N} \cdot \sum_{p = 1}^{N} (x_{i}^{(p)} - E_{i}) \cdot (x_{j}^{(p)} - E_{j}) .

(3)

The variance is usually estimated by a formula

V_{i} = \frac{1}{N - 1} \cdot \sum_{p = 1}^{N} {(x_{i}^{(p)} - E_{i})}^{2}

(4)

or, sometimes,

V_{i} = \frac{1}{N} \cdot \sum_{p = 1}^{N} {(x_{i}^{(p)} - E_{i})}^{2},

(5)

and the correlation is estimated as

ρ_{i j} = \frac{C_{i j}}{σ_{i} \cdot σ_{j}} .

(6)

Comment. We are interested in large databases, in which the number N of records is large. For large N, the difference between the usual un-biased estimate for V_i (with N − 1 in the denominator) and the estimate with N is negligible. To simplify computations, in this paper, by V_i and σ_i, we will mean the versions corresponding to N; our results can be easily be reformulated for the un-biased estimates, which in our terms take the form $V_{i} \cdot \frac{N - 1}{N}$ and $σ_{i} \cdot \frac{\sqrt{N - 1}}{\sqrt{N}}$ .

In statistical data processing, privacy leads to uncertainty

To maintain privacy, we replace each numerical value $x_{i}^{(p)}$ with the corresponding interval. Different values from these intervals lead, in general, to different values of the resulting statistical characteristics. Hence, for each characteristic, we get a whole interval of possible values.

If this interval is too wide, the resulting range is useless (e.g., for correlation, the interval [−1, 1] is useless). It is therefore desirable to select, among all possible subdivisions into boxes which preserve k-anonymity (and ℓ-diversity), the one which leads to the narrowest intervals for the desired statistical characteristic.

What we do in this paper

In Section 2, following [7], we describe how this problem is solved now. Please note that because our objective is to generalize these formulas to the weighted case, the notations that we use in Section 2 are slightly different from the notations from [7].

Then, in Section 3, we explain how fuzzy-motivated ideas can improve the corresponding estimates.

II. How This Problem Is Solved Now

Estimating accuracy caused by privacy-based subdivision into boxes: case of k-anonymity

To minimize uncertainty, we select the smallest boxes. Hence, each box B should have exactly k records.

For each combination of values $x_{i}^{(p)}$ from the corresponding intervals $[{x̃}_{i}^{(p)} - Δ_{i}^{(p)}, {x̃}_{i}^{(p)} + Δ_{i}^{(p)}]$ , we get:

C (x_{1}^{(1)}, x_{2}^{(1)}, \dots, x_{n}^{(N)}) = C ({x̃}_{1}^{(1)} + Δ x_{1}^{(1)}, {x̃}_{2}^{(1)} + Δ x_{2}^{(1)}, \dots, {x̃}_{n}^{(N)} + Δ x_{n}^{(N)}),

(7)

where each difference $Δ x_{k}^{(p)} \overset{def}{=} x_{k}^{(p)} - {x̃}_{k}^{(p)}$ satisfies the inequality $| Δ x_{i}^{(p)} | \leq Δ_{i}^{(p)}$ . When we have many records, boxes are small, so we can use a linear approximation:

C = C̃ + \sum_{p = 1}^{N} \sum_{i = 1}^{n} \frac{\partial C}{\partial x_{i}} \cdot Δ x_{i}^{(p)},

(8)

where $C̃ \overset{def}{=} C ({x̃}_{1}^{(1)}, {x̃}_{2}^{(1)}, \dots, {x̃}_{n}^{(N)})$ . The range of this linear expression is [C̃ − Δ, C̃ + Δ], where

Δ \overset{def}{=} \sum_{p = 1}^{N} \sum_{i = 1}^{n} | \frac{\partial C}{\partial x_{i}} | \cdot Δ_{i}^{(p)} = k \cdot \sum_{B} \sum_{x \in B} \sum_{i = 1}^{n} | \frac{\partial C}{\partial x_{i}} | \cdot Δ_{i} (x) .

(9)

Expressions for the corresponding partial derivatives

The estimate for the accuracy Δ is described in terms of partial derivatives $\frac{\partial C}{\partial x_{i}}$ of the statistical characteristic C. For the mean E_i, the derivative is equal to

\frac{\partial E_{i}}{\partial x_{i}} = \frac{1}{N} .

(10)

For the variance V_i, we have

\frac{\partial V_{i}}{\partial x_{i}} = \frac{2 \cdot (x_{i} - E_{i})}{N} .

(11)

Therefore, for $σ_{i} = \sqrt{V_{i}}$ , we get

\frac{\partial σ_{i}}{\partial x_{i}} = \frac{x_{i} - E_{x}}{N \cdot σ_{i}} .

(12)

For the covariance C_ij, we have

\frac{\partial C_{i j}}{\partial x_{i}} = \frac{x_{j} - E_{j}}{N} .

(13)

For the correlation ρ_ij, we have:

\frac{\partial ρ_{i j}}{\partial x_{i}} = \frac{1}{N} \cdot \frac{(x_{j} - E_{j}) - \frac{C_{i j}}{σ_{i}^{2}} \cdot (x_{i} - E_{i})}{σ_{i} \cdot σ_{j}} .

(14)

For all these characteristics C, the derivative takes the form

\frac{\partial C}{\partial x_{i}} = \frac{1}{N} \cdot b_{i} (x)

(15)

for some expression b_i(x).

Towards optimal subdivision into boxes

The overall expression for Δ is a sum of terms corresponding to different points. So, to minimize Δ, we must, for each point, minimize the corresponding term

\sum_{i = 1}^{n} | \frac{\partial C}{\partial x_{i}} | \cdot Δ_{i} (x) .

(16)

Because of the relation between the partial derivatives and b_i(x), this minimization is equivalent to minimizing the term $\sum_{i = 1}^{n} a_{i} (x) \cdot Δ_{i} (x)$ , where we denoted $a_{i} (x) \overset{def}{=} | b_{i} (x) |$ .

The only constraint on the values Δ_i(x) is that the corresponding box should contain exactly k different points. The number of points can be obtained by multiplying the data density ρ(x) by the box volume $\prod_{i = 1}^{n} (2 Δ_{i} (x))$ . The data density can be estimated based on the data. So, we minimize the expression

\sum_{i = 1}^{n} a_{i} (x) \cdot Δ_{i} (x)

(17)

under the constraint

ρ (x) \cdot 2^{n} \cdot \prod_{i = 1}^{n} Δ_{i} (x) = k .

(18)

(Asymptotically) optimal subdivision into boxes (case of k-anonymity)

The Lagrange multiplier technique leads to

Δ_{i} (x) = \frac{c (x)}{a_{i} (x)},

(19)

for some c(x). From the constraint (18), we get

c (x) = \frac{1}{2} \cdot \sqrt[n]{\frac{k}{ρ (x)} \cdot \prod_{j = 1}^{n} a_{j} (x)} .

(20)

This means that around each point x, we need to select the box with half-widths

Δ_{i} (x) = \frac{1}{2} \cdot \sqrt[n]{\frac{k}{ρ (x)}} \cdot \frac{\sqrt[n]{\prod_{j = 1}^{n} a_{j} (x)}}{a_{i} (x)} .

(21)

The resulting accuracy is equal to

Δ = \frac{n}{N} \cdot \sum_{x} c (x),

(22)

where the sum is taken over all N data points x.

We need to dismiss rare points

In many practical situations, we have rare points, for which the smallest box containing k of them is huge. Such a big-size box will contribute a large amount of uncertainty to Δ; so we should dismiss such rare points.

If we select a subset S ⊂ {1, 2, …, N} of the set of N original points, then the privacy-related uncertainty reduces to

\frac{n}{# S} \cdot \sum_{x \in S} c (x),

(23)

where #(S) denote the number of points in the set S. The statistical accuracy reduces to

\frac{A}{\sqrt{# (S)}}

(24)

(see, e.g., [5]). Minimizing the sum

\frac{n}{# (S)} \cdot \sum_{x \in S} c (x) + \frac{A}{\sqrt{# (S)}}

(25)

leads to selecting all x with c(x) ≤ c₀, where c₀ minimizes the sum

\frac{n}{# {x : c (x) \leq c_{0}}} \cdot \sum_{x : c (x) \leq c_{0}} c (x) + \frac{A}{\sqrt{# {x : c (x) \leq c_{0}}}} .

(26)

Examples

For estimating the mean E_i, we have a_i(x) = const and thus,

c (x) = c o n s t \cdot \frac{1}{\sqrt[n]{ρ (x)}} .

(27)

In this case, c(x) is a decreasing function of density. So, dismissing points with c(x) > c₀ is equivalent to dismissing all the points with ρ(x) < ρ₀ (for some ρ₀).

For computing covariance C_ij, the derivative is proportional to x_i − E_i. Thus, the values a_i(x) are proportional to |x_i − E_i|. So, the upper threshold c₀ on c(x) is equivalent to the lower threshold on the ratio

\frac{ρ (x)}{| x_{i} - E_{i} | \cdot | x_{j} - E_{j} |} .

(28)

Hence, we can also use points x with small ρ(x), provided that if x_i or x_j is close to the corresponding mean. Using extra points x improves accuracy.

How to also take into account ℓ-diversity

Up to now, we only took into account the k-anonymity requirement. We also need to take into account that within each box, for each variable x_i, there are ≥ ℓ different values of x_i. To formalize this requirement, we first need to describe what “different” means.

Usually, for each variable i, different means that

| x_{i} - x_{i}^{'} | \geq ε_{i}

(29)

for some threshold ε_i. Thus, ℓ different values means that 2Δ_i(x) ≥ ℓ · ε_i. So, the problem is to find Δ_i(x) such that

\sum_{i = 1}^{n} a_{i} (x) \cdot Δ_{i} (x) \to min

(30)

under the constraints

\prod_{i = 1}^{n} Δ_{i} (x) \geq \frac{k}{2^{n} \cdot ρ (x)}

(31)

and

2 Δ_{i} (x) \geq ℓ \cdot ε_{i}

(32)

for all i.

According to [7], the solution to this optimization problem is as follows: around each point x, we first compute the values

Δ_{i} (x) = \frac{1}{2} \cdot \sqrt[n]{\frac{k}{ρ (x)}} \cdot \frac{\sqrt[n]{\prod_{j = 1}^{n} a_{j} (x)}}{a_{i} (x)} .

(33)

If 2Δ_i(x) ≥ ℓ · ε_i for all i, we select Δ_i(x). Otherwise, we sort the quantities by a_i(x) · ε_i:

a_{1} (x) \cdot ε_{1} \geq a_{2} (x) \cdot ε_{2} \geq \dots \geq a_{n} (x) \cdot ε_{n} .

(34)

Then, for each t from 1 to n, we compute

c_{t} = \frac{1}{2} \cdot {(\frac{k \cdot \prod_{i = t + 1}^{n} a_{i} (x)}{ρ (x) \cdot ℓ^{t} \cdot \prod_{i = 1}^{t} ε_{i}})}^{1 / (n - t)} .

(35)

For each t, if $\frac{2 c_{t}}{ℓ} \geq a_{t + 1} (x) \cdot ε_{t + 1}$ , we compute

Δ (t) \overset{def}{=} \frac{1}{2} \cdot ℓ \cdot \sum_{i = 1}^{t} a_{i} (x) \cdot ε_{i} + (n - t) \cdot c_{t} .

(36)

We select t for which Δ(t) is the smallest, and take:

$Δ_{i} (x) = \frac{1}{2} \cdot ℓ \cdot ε_{i}$ for i ≤ t, and
$Δ_{i} (x) = \frac{c_{t}}{a_{i} (x)}$ for i > t.

Comment. The computation time of this algorithm is quadratic in n. This is OK, since the number n of different characteristics is usually reasonably small. What is important is that the algorithm is still linear-time in terms of the number of records N.

III. Fuzzy-Motivated Idea

Main idea

In [7], to improve the accuracy of the resulting estimate, we propose to ignore some data points while keeping other data points. In other words, we propose a crisp separation between data points that we keep and data points that we ignore. Fuzzy logic has taught us that in many cases, it is beneficial to replace such a crisp separation with a “fuzzy” one in which, instead of ignoring or keeping a data point, we take a data point with a certain degree; see, e.g., [2], [4], [8].

Implementing the idea

Specifically, instead of using the above formula for computing the statistical characteristics, in which all data points are treated equally, we assign a weight w(x) ≥ 0 to each data point so that ∑w(x) = 1, and use the weighted estimates for all the statistical characteristics:

E_{i} = \sum_{x} w (x) \cdot x_{i}, σ_{i}^{2} = \sum_{x} w (x) \cdot {(x_{i} - E_{i})}^{2},

(37)

C_{i j} = \sum_{x} w (x) \cdot (x_{i} - E_{i}) \cdot (x_{j} - E_{j}), ρ_{i j} = \frac{C_{i j}}{σ_{i} \cdot σ_{j}} .

(38)

Optimization problem

Our objective is to find the weights w(x) for which the resulting uncertainty is the smallest possible. Similarly to the crisp case, this uncertainty consists of two parts: the part coming from the privacy-motivated uncertainty and the part coming from the fact that the size is finite.

One can check that for privacy-motivated uncertainty, the corresponding derivatives $\frac{\partial C}{\partial x_{i}}$ are proportional to the weight w(x). For each box, we thus face the exact same optimization problem for finding the best sizes Δ_i(x) of the corresponding privacy-related box. As a result, for the overall privacy-motivated uncertainty, we get the expression $n \cdot \sum_{x} w (x) \cdot c (x)$ .

For the statistical part: if we simply estimate the variance of the estimate for the mean E_i = ∑w(x) · x_i, then, due to the fact that the variance of the sum of independent variables is equal to the sum of the variances, we conclude that the variance of this estimate is proportional to ∑ w²(x); see, e.g., [5]. Thus, the standard deviation of this estimate is proportional to

\sqrt{\sum_{x} w^{2} (x) .}

(39)

For the traditional equal-weight estimate, when

w (x) = \frac{1}{# (S)}

(40)

for all x, the proportionality coefficient becomes equal to the expression

\frac{1}{\sqrt{# (S)}}

(41)

that we used in Section 2.

One can check that, similarly, estimates for the accuracy of other statistical characteristics can be obtained from the estimates provided in Section 2 by replacing $\frac{1}{# (S)}$ with $\sqrt{\sum_{x} w^{2} (x)}$ , i.e., this part is equal to

A \cdot \sqrt{\sum_{x} w^{2} (x)} .

(42)

Thus, to minimize the overall inaccuracy, we need to minimize the following sum:

n \cdot \sum_{x} w (x) \cdot c (x) + A \cdot \sqrt{\sum_{x} w^{2} (x)}

(43)

under the constraints $\sum_{x} w (x) = 1$ and w(x) ≥ 0.

Solving the resulting optimization problem: general idea

By applying the Lagrange multiplier method to the above constraint optimization problem, we can reduce this problem to the following unconstrained optimization problem:

n \cdot \sum_{x} w (x) \cdot c (x) + A \cdot \sqrt{\sum_{x} w^{2} (x)} - λ \cdot (\sum_{x} w (x) - 1) \to min,

(44)

for an appropriate Lagrange multiplier λ. Differentiating this objective function with respect to w(x) and equating the derivative to 0, we conclude that

n \cdot c (x) + \frac{A \cdot w (x)}{\sqrt{\sum_{y} w^{2} (y)}} - λ = 0,

(45)

i.e., that

w (x) = \frac{1}{A} \cdot (λ - n \cdot c (x)) \cdot \sqrt{\sum_{y} w^{2} (y)} .

(46)

To be more precise, since we require that w(x) ≥ 0, this formula only holds when n · c(x) ≤ λ; when n · c(x) > λ, we should get w(x) = 0.

Towards computing the auxiliary parameter λ

How can we find λ? By squaring both sides of this formula, we get

w^{2} (x) = \frac{1}{A^{2}} \cdot {(λ - n \cdot c (x))}^{2} \cdot \sum_{y} w^{2} (y) .

(47)

By adding left- and right-hand sides corresponding to different x, we get

\sum_{x} w^{2} (x) = \frac{1}{A^{2}} \cdot (\sum_{x} {(λ - n \cdot c (x))}^{2}) \cdot \sum_{y} w^{2} (y) .

(48)

Dividing both sides of this equality by $\sum_{x} w^{2} (x) = \sum_{y} w^{2} (y)$ , we conclude that

1 = \frac{1}{A^{2}} \cdot \sum_{x} {(λ - n \cdot c (x))}^{2},

(49)

i.e., that

\sum_{x} {(λ - n \cdot c (x))}^{2} - A^{2} = 0 .

(50)

This is a quadratic equation in terms of λ, namely:

Ñ \cdot λ^{2} - 2 λ \cdot n \cdot \sum_{x} c (x) + n^{2} \cdot \sum_{x} c^{2} (x) - A^{2} = 0,

(51)

where e Ñ is the total number of points that we did not dismiss, i.e., for which n · c(x) < λ, and the sums are taken over all such points.

From this quadratic equation, we can find λ. Thus, we naturally arrive at the following iterative algorithm for computing λ.

Iterative algorithm for computing the auxiliary parameter λ

The goal of this algorithm is to find the threshold value λ, so that points x for which n · c(x) ≥ λ will be dismissed from our estimates (i.e., we would have w(x) = 0 for such points).

In the beginning, we do not have any reason to dismiss any values, so we start with the first approximation λ₀.

On each iteration k, we start with the value λ_k−1 obtained on the previous iteration, and compute the next approximation λ_k as follows.

First, we compute the total numbers e Ñ of points x for which n · c(x) < λ_k−1.
Then, we compute the sums $\sum_{x} c (x)$ and $\sum_{x} c^{2} (x)$ over all such points.
Based on these values, we solve the quadratic equation (51) and find the next approximation λ_k.

We stop iterations when the process converges, i.e., when

λ_{k} = λ_{k - 1} .

(52)

Towards computing w(x)

We know, from the formula (46), that for those points for which n · c(x) < λ, we have

w (x) = K \cdot (λ - c (x)),

(53)

for some constant K. To find K, we can use the fact that $\sum_{x} w (x) = 1$ . Substituting the expression (53) into this constraint, we conclude that

1 = K \cdot (Ñ \cdot λ - \sum_{x} c (x)) .

(54)

Since we have already computed the values e Ñ, λ, and $\sum_{x} c (x)$ when we computed λ, we can thus compute K.

So, we arrive at the following formula for computing the desired weights.

Formula for computing the optimal weights w(x)

By running the above iterative algorithm, we have computed the auxiliary value λ. In the process of computing λ, we have computed the values e Ñ and $\sum_{x} c (x)$ , where the sum is taken over all the points x for which n · c(x) < λ.

Now, we compute

K = \frac{1}{Ñ \cdot λ - \sum_{x} c (x)} .

(55)

The optimal weights can now be computed as follows:

when n · c(x) ≥ λ, the optimal weight is w(x) = 0;
when n · c(x) < λ, the optimal weight is equal to
$w (x) = K \cdot (λ - c (x)) .$ (56)

Comment. As expected, the larger the uncertainty contribution c(x) from a point, the smaller the weight with which we take this point. When this contribution is large enough (i.e., larger than the threshold determined by the auxiliary parameter λ), we completely ignore such points.

IV. Boxes Appropriate for Several Different Characteristics

What we provided before

In the previous sections, we described how, for each statistical characteristic C, we can find the boxes (i.e., data anonymization) that leads to the most accurate estimate of this selected characteristic.

Remaining problem

In practice, we may need to compute the values of different statistical characteristics. The problem is that optimal boxes corresponding to different characteristics C are, in general, different.

For example, boxes that lead to most accurate estimates e Ẽ of mean E may lead to very inaccurate estimates e C̃_ij of correlation C_ij, and vice versa.

Towards a possible solution to this problem

Based on the previous experience, we know how many times users were looking for values of different statistical characteristics; in other words, we know the probabilities $p_{C} \geq 0 (\sum_{C} p_{C} = 1)$ of looking for different characteristics C.

We also know what accuracy $Δ_{0}^{C}$ is desirable for estimating each characteristic C. For example, we may fix the same relative error for all estimates, and take, e.g., $Δ_{0}^{C} = 0.1 \cdot C̃$ if this relative error is 10%. Then, for each characteristic C, the accuracy of estimating this characteristic is better gauged not by the absolute accuracy Δ^C but rather by the ratio

q_{C} \overset{def}{=} \frac{Δ^{C}}{Δ_{0}^{C}}

(57)

describing how close we are to the desired accuracy.

In this situation, a reasonable idea is to minimize average quality

q \overset{def}{=} \sum_{C} p_{C} \cdot q_{C} .

(58)

Towards an algorithm

How can we solve the corresponding optimization problem? The objective function q has the form

q = \sum_{C} \frac{p_{C}}{Δ_{0}^{C}} \cdot Δ^{C},

(59)

i.e., the for

q = \sum_{C} \frac{p_{C}}{Δ_{0}^{C}} \cdot \sum_{p = 1}^{N} \sum_{i = 1}^{n} | \frac{\partial C}{\partial x_{i}} | \cdot Δ_{i}^{(p)} .

(60)

By changing the order of summation, we get an equivalent formula

q = \sum_{p = 1}^{N} \sum_{i = 1}^{n} (\sum_{C} \frac{p_{C}}{Δ_{0}^{C}} \cdot | \frac{\partial C}{\partial x_{i}} |) \cdot Δ_{i}^{(p)} .

(61)

This optimization problem is similar to the optimization problem corresponding to the case of a single statistical characteristic C, with the only difference that instead of the original partial derivatives $\frac{\partial C}{\partial x_{i}}$ , we use a weighted combination

\sum_{C} \frac{p_{C}}{Δ_{0}^{C}} \cdot | \frac{\partial C}{\partial x_{i}} |

(62)

of these derivatives.

In terms of the coefficients a_i(x) introduced in Section 2, this means that instead of using the values $a_{i}^{C} (x)$ corresponding to an individual characteristic C, we must use a linear combination of these values:

a_{i} (x) = \sum_{C} \frac{p_{C}}{Δ_{0}^{C}} \cdot a_{i}^{C} (x) .

(63)

Resulting algorithm

Use the same algorithm(s) as in Sections 2 and 3, except that instead of the values $a_{i}^{C}$ corresponding to an individual statistical characteristic C, we should use the values (63).

Acknowledgment

Support for this project was provided by the National Institutes of Health (NIH), through a Small Business Innovation Research grant (award number 1R43TR000173-01) to Applied Biomathematics, but the views and opinions expressed herein should not be construed to be those of the National Institutes of Health.

The authors are thankful to the anonymous referees for valuable suggestions.

Contributor Information

G. Xiang, Email: gxiang@sigmaxi.net, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA.

S. Ferson, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA

L. Ginzburg, Applied Biomathematics, 100 North Country Rd., Setauket, NY 11733, USA

L. Longpré, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA

E. Mayorga, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA

O. Kosheleva, Email: olgak@utep.edu, University of Texas at El Paso, 500 W. University, El Paso, TX 79968, USA.

References

1.Ghinita G, Karras P, Kalnis P, Mamoulis N. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems. 2009;Vol. 34(No. 2) Article 9. [Google Scholar]
2.Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic. Upper Saddle River, New Jersey: Prentice Hall; 1995. [Google Scholar]
3.Nguyen HT, Kreinovich V, Wu B, Xiang G. Computing Statistics under Interval and Fuzzy Uncertainty. Springer Verlag; 2012. [Google Scholar]
4.Nguyen HT, Walker EA. First Course In Fuzzy Logic. Boca Raton, Florida: CRC Press; 2006. [Google Scholar]
5.Sheskin DJ. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Florida: Chapman & Hall/CRC; 2007. [Google Scholar]
6.Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based System. 2002;Vol. 10(No. 5):557–570. [Google Scholar]
7.Xiang G, Kreinovich V. Data anonymization that leads to the most accurate estimates of statistical characteristics; Proceedings of the IEEE Series of Symposia on Computational Intelligence SSCI’2013; April 16–19, 2013; Singapore. to appear. [Google Scholar]
8.Zadeh LA. Fuzzy sets. Information and Control. 1965;Vol. 8:338–353. [Google Scholar]

[R1] 1.Ghinita G, Karras P, Kalnis P, Mamoulis N. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems. 2009;Vol. 34(No. 2) Article 9. [Google Scholar]

[R2] 2.Klir GJ, Yuan B. Fuzzy Sets and Fuzzy Logic. Upper Saddle River, New Jersey: Prentice Hall; 1995. [Google Scholar]

[R3] 3.Nguyen HT, Kreinovich V, Wu B, Xiang G. Computing Statistics under Interval and Fuzzy Uncertainty. Springer Verlag; 2012. [Google Scholar]

[R4] 4.Nguyen HT, Walker EA. First Course In Fuzzy Logic. Boca Raton, Florida: CRC Press; 2006. [Google Scholar]

[R5] 5.Sheskin DJ. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Florida: Chapman & Hall/CRC; 2007. [Google Scholar]

[R6] 6.Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based System. 2002;Vol. 10(No. 5):557–570. [Google Scholar]

[R7] 7.Xiang G, Kreinovich V. Data anonymization that leads to the most accurate estimates of statistical characteristics; Proceedings of the IEEE Series of Symposia on Computational Intelligence SSCI’2013; April 16–19, 2013; Singapore. to appear. [Google Scholar]

[R8] 8.Zadeh LA. Fuzzy sets. Information and Control. 1965;Vol. 8:338–353. [Google Scholar]

PERMALINK

Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics: Fuzzy-Motivated Approach

G Xiang

S Ferson

L Ginzburg

L Longpré

E Mayorga

O Kosheleva

Abstract

I. Formulation of the Problem

Need to preserve privacy

How to preserve privacy: k-anonymity and ℓ-diversity

Statistical data processing

In statistical data processing, privacy leads to uncertainty

What we do in this paper

II. How This Problem Is Solved Now

Estimating accuracy caused by privacy-based subdivision into boxes: case of k-anonymity

Expressions for the corresponding partial derivatives

Towards optimal subdivision into boxes

(Asymptotically) optimal subdivision into boxes (case of k-anonymity)

We need to dismiss rare points

Examples

How to also take into account ℓ-diversity

III. Fuzzy-Motivated Idea

Main idea

Implementing the idea

Optimization problem

Solving the resulting optimization problem: general idea

Towards computing the auxiliary parameter λ

Iterative algorithm for computing the auxiliary parameter λ

Towards computing w(x)

Formula for computing the optimal weights w(x)

IV. Boxes Appropriate for Several Different Characteristics

What we provided before

Remaining problem

Towards a possible solution to this problem

Towards an algorithm

Resulting algorithm

Acknowledgment

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases