Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Mar 31;117(15):8344–8352. doi: 10.1073/pnas.1914598117

Towards formalizing the GDPR’s notion of singling out

Aloni Cohen a,b,c,1,2, Kobbi Nissim d,1,2
PMCID: PMC7165454  PMID: 32234789

Significance

This article addresses a gap between legal and technical conceptions of data privacy and demonstrates how it can be minimized. The article focuses on “singling out,” which is a concept appearing in the GDPR. Our analysis draws on the legislation, regulatory guidance, and mathematical reasoning to derive a technical concept—“predicate singling out”—aimed at capturing a core part of GDPR’s intent. Examination of predicate singling out sheds light on the concept of singling out and the question of whether existing technologies protect against such a threat. Conceptually, this work demonstrates the role that principled analysis supported by mathematical argument can and should play in articulating and informing public policy at the interface between law and technology.

Keywords: privacy, GDPR, singling out, differential privacy, k-anonymity

Abstract

There is a significant conceptual gap between legal and mathematical thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators. We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset x using the output of a data-release mechanism M(x) if it finds a predicate p matching exactly one row in x with probability much better than a statistical baseline. A data-release mechanism that precludes such attacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and k-anonymity are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, k-anonymity does not: There exists a simple predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.


Data-privacy laws—like the Health Insurance Portability and Accountability Act, the Family Educational Rights and Privacy Act (FERPA), and Title 13 in the United States; and the European Union’s (EU’s) General Data Protection Regulation (GDPR)—govern the use of sensitive personal information.* These laws delineate the boundaries of appropriate use of personal information and impose steep penalties upon rule breakers. To adhere to these laws, practitioners need to apply suitable controls and statistical disclosure-limitation techniques. Many commonly used techniques, including pseudonymization, k-anonymity, bucketing, rounding, and swapping, offer privacy protections that are seemingly intuitive, but only poorly understood. And while there is a vast literature of best practices, a litany of successful privacy attacks demonstrates that these techniques often fall short of the sort of privacy envisioned by legal standards (e.g., ref. 1).

A more disciplined approach is needed. However, the significant conceptual gap between legal and mathematical thinking around data privacy poses a challenge. Privacy regulations are grounded in legal concepts, such as personally identifiable information (PII), linkage, distinguishability, anonymization, risk, and inference. In contrast, much of the recent progress in data-privacy technology is rooted in formal mathematical privacy models, such as differential privacy (2), that offer a foundational treatment of privacy, with provable privacy guarantees, meaningful semantics, and composability. And while such models are being actively developed and implemented in the academy, industry, and government, there is a lack of discourse between the legal and mathematical conceptions. The effect is uncertainty as to which technical offerings adequately match expectations expressed in legal standards (3).

Bridging between Legal and Technical Concepts of Privacy

We aim to address this uncertainty by translating between the legal and the technical. To do so, we begin with a concept appearing in the law, then model some aspect of it mathematically. With the mathematical formalism in hand, we can better understand the requirements of the law, their implications, and the techniques that might satisfy them.

In particular, we study the concept of “singling out” from the GDPR (4). We examine what it means for a data-anonymization mechanism to ensure “security against singling out” in a data release. Preventing singling out attacks in a dataset is a necessary (but maybe not sufficient) precondition for a dataset to be considered effectively anonymized and thereby free from regulatory restrictions under the GDPR. Ultimately, our goal is to better understand a concept foundational to the GDPR, by enabling a rigorous mathematical examination of whether certain classes of techniques (i.e., k-anonymity and differential privacy) provide security against singling out.

We are not the first to study this issue. The Article 29 Data Protection Working Party, an EU advisory body, provided guidance about the use of various privacy technologies—including k-anonymity and differential privacy—as anonymization techniques (5). Its analysis is centered on asking whether each technology effectively mitigates three risks: “singling out, linkability, and inference.” For instance, their “Opinion on Anonymisation Techniques” concludes that with k-anonymity, singling out is no longer a risk, whereas with differential privacy, it “may not” be a risk (5). Though similar in purpose to our work, its technical analyses are informal and coarse. Revisiting these questions with mathematical rigor, we recommend reconsidering the Working Party’s conclusions.

This work is part of a larger effort to bridge between legal and technical conceptions of privacy. An earlier work analyzed the privacy requirements of FERPA and modeled them in a game-based definition, as is common in cryptography (6). The definition was used to argue that the use of differentially private analyses suffices for satisfying a wide range of interpretations of FERPA. An important feature of FERPA that enabled that analysis is that FERPA and its accompanying documents contain a rather detailed description of a privacy attacker and the attacker’s goals.

1. Singling Out in the GDPR

We begin with the text of the GDPR (4). It consists of articles detailing the obligations placed on processors of personal data, as well as recitals containing explanatory remarks. Article 1 of the regulation delineates its scope:

This regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data and rules relating to the free movement of personal data.

On the other hand, the GDPR places no restrictions on the processing of nonpersonal data, which includes personal data that have been “rendered anonymous.”

The term “personal data” is defined in Article 4 of the GDPR to mean “any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly.” What it means for a person to be “identified, directly or indirectly” is further elaborated in Recital 26:

To determine whether a natural person is identifiable, account should be taken of all of the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

Thus, singling out is one way to identify a person in data, and only data that do not allow singling out may be excepted from the regulation.

Singling out is the only criterion for identifiability explicitly mentioned in the GDPR, the only occurrence of the term being the passage quoted above. For insight as to the meaning of singling out, we refer to two documents prepared by the Article 29 Data Protection Working Party.“Opinion on the Concept of Personal Data” (7) elaborates on the meaning of “identifiable, directly or indirectly.” A person is identified “within a group of persons [when] he or she is distinguished from all other members of the group.” One way of distinguishing a person from a group is by specifying “criteria which allows him to be recognized by narrowing down the group to which he belongs.” If the group is narrowed down to an individual, that individual has been singled out.§ Similarly, the Working Party’s “Opinion on Anonymisation Techniques” describes singling out as “isolat[ing] some or all records which identify an individual in [a] dataset.” Looking ahead, we will call this “isolating” an individual in the dataset and argue that not every instance of isolation should be considered a singling-out attack.

We highlight three additional insights from ref. 7 that inform our work. First, identification does not require a name or any other traditional identifier. For instance, singling out can be done with a “small or large” collection of seemingly innocuous traits (e.g., “the man wearing a black suit”) (7). Indeed, this is what is meant by indirectly identifiable. An example of singling out in practice cited by the Working Party “Opinion on Anonymisation Techniques” (5) showed that four locations sufficed to uniquely identify 95% of people in a pseudonymized dataset of time-stamped locations. This is considered singling out, even without a method of linking the location traces to individuals’ names.

Second, identifiable data may come in many forms, including microdata, aggregate statistics, news articles, encrypted data, video footage, and server logs. What’s important is not the form of the data—it is whether the data permit an individual to be singled out. We apply this same principle to the manner in which an individual is singled out within a dataset. Most examples focus on specifying a collection of attributes (e.g., four time-stamped locations) that match a single person in the data. A collection of attributes can be viewed as corresponding to a “predicate”: a function that assigns to each person in the dataset a value 0 or 1 (interpreted as false or true, respectively). We interpret the regulation as considering data to be personal data if an individual can be isolated within a dataset using any predicate, not only those that correspond to collections of attributes.

Third, whether or not a collection of attributes identifies a person is context-dependent. “A very common family name will not be sufficient to identify someone—i.e., to single someone out—from the whole of a country’s population, while it is likely to achieve identification of a pupil in a classroom” (7). Both the prevalence of the name and the size of the group are important in the example and will be important in our formalization.

2. Main Findings

A. Defining Security Against Predicate Singling Out.

We formalize and analyze “predicate singling out,” a notion which is intended to model the GDPR’s notion of singling out. Following the discussion above, we begin with the idea that singling out an individual from a group involves specifying a predicate that uniquely distinguishes the individual, which we call “isolation.” Using this terminology, an intuitive interpretation of the GDPR is that rendering data anonymous requires preventing isolation. Trying to make this idea formal, we will see that it requires some refinement.

We restrict our attention to datasets x=(x1,,xn) of size n, where each row xi is sampled independently from some underlying probability distribution D over a universe X. The dataset x is assumed to contain personal data corresponding to individuals, with at most one row per individual. For example, x might consist of home listings, hospital records, internet browsing history, or any other personal information. A mechanism M takes x as input and outputs some data release y=M(x), be it a map of approximate addresses, aggregate statistics about disease, the result of applying machine learning techniques, or pseudonymized internet histories. We call M an “anonymization mechanism” because it purportedly anonymizes the personal data x.

An adversary A attempts to output a predicate p:X{0,1} that isolates a row in x, i.e., there exists i[n] such that p(xi)=1 and p(xj)=0 for all ji. We emphasize that it is rows in the original dataset x on which the predicate acts, not the output y. In part, this is a by-product of our desire to make no assumptions on the form of M’s output. While it might make sense to apply a predicate to isolate a row in pseudonymized microdata, it is far from clear what it would mean for a synthetic dataset or for aggregate statistics. Observe that this choice also rules out predicates p that “isolate” rows by referring to their position in x (i.e., “the seventh row”).

M prevents isolation if there doesn’t exist an adversary A that isolates a row in x, except with very small probability over the randomness of sampling xDn, the randomness of the mechanism yM(x), and the randomness of adversary A(y). Unfortunately, this is impossible to achieve by any mechanism M. To wit, there are trivial adversaries—those that do not look at the outcome y—that isolate a row with probability approximately 37%. The trivial adversaries simply output any predicate p that matches a 1/n fraction of the distribution D. Isolation is, hence, not necessarily indicative of a failure to protect against singling out, as a trivial adversary would succeed with 37% probability (for any n), even if M does not output anything at all.

Example: Consider a dataset of size n=365, including information about random people selected at random from the US population. To isolate a person, a trivial adversary may output p=(born on March 15th). This predicate will isolate a row with probability

36511365(11365)36437%.

Furthermore, a trivial adversary need not know the distribution D to isolate with probability 37%, as long as D has sufficient min-entropy (Fact 3.1).

A trivial adversary can give us a baseline against which to measure isolation success. But the baseline should not simply be a 37% chance of success. Consider the earlier example of a dataset of 365 random Americans. What if an adversary output predicates like p=(born on March15thveganspeaks Dutchconcert pianist), and managed to isolate 10% of the time? Though 10% is much less than 37%, the predicate is extremely specific and unlikely to isolate a person by chance.

We formalize this intuition by considering the baseline risk of isolation as a function of the weight of the predicate p: the chance that p matches a random row sampled from the distribution D. The baseline for predicates of weight 1/n is 37%, but the baseline for an extremely specific predicate may be much lower. The more specific the predicate, the closer the baseline gets to zero. Our primary focus in this paper is on the regime of predicate weights where the baseline is negligible, corresponding to predicates with negligible weight.

Definition 4.5 (PSO Security, Informal): An adversary predicate singles out a row in x if it outputs a predicate that isolates a row with probability significantly higher than the baseline risk. A mechanism M is “secure against predicate singling out” (PSO secure) if no adversary can use its output y=M(x) to predicate single out.

B. Analyzing Security Against Predicate Singling Out.

Having formulated PSO security, our next goal is to understand the guarantee it offers, what mechanisms satisfy it, and how this concept relates to existing privacy concepts, including differential privacy and k-anonymity.

Two desirable properties of a privacy concept are robustness to “postprocessing” and to “composition.” The former requires that if a mechanism M is deemed secure, then anything that can be computed using the outcome of M should also be deemed secure. Hence, the outcome may be reused without creating additional privacy risk. For instance, if a PSO-secure mechanism M outputs microdata, then any statistics that can be computed from those microdata should also be PSO secure. PSO security is robust to postprocessing. The analysis is simple and shown in ref. 9.

Incomposability of PSO Security.

We would like that the privacy risk of multiple data releases is not significantly greater than the accumulated risks of the individual releases. In this case, we say that the privacy concept composes. We prove that PSO security does not compose and give two examples of this failure. First, we show that releasing a single aggregate statistic is PSO secure, but superlogarithmically many statistics may allow an adversary to predicate single out. Concretely, a mechanism that outputs a single count is PSO secure. Yet a mechanism that outputs ω(log(n)) counts may allow an adversary to isolate a row with probability arbitrarily close to one by using a predicate with negligible weight (and negligible baseline). Second, we construct a less natural pair of mechanisms that individually are PSO secure but together allow an adversary to recover a row in the dataset. Consequently, the adversary can predicate single out by isolating this row using a predicate with negligible weight.

Do Differential Privacy and k-Anonymity Guarantee PSO Security?

Differential privacy is a requirement of data analyses mechanisms that limits the dependency of a mechanism’s output distribution on any single individual’s contribution (10). k-anonymity is a requirement from data releases that the (quasi) identifying data of every person in the release should be identical to that of at least k1 other individuals in the release (11, 12) (see Definitions 6.1 and 7.1 for formal definitions).

Differential privacy is not necessary for PSO security, as an exact count is PSO secure, but is not differentially private. However, differential privacy does provide PSO security. The proof relies on the connection between differential privacy and statistical generalization guarantees (13, 14). We show that predicate singling out implies a form of overfitting to the underlying dataset. If a mechanism is differentially private, it prevents this form of overfitting and, hence, protects against predicate singling out.

On the other hand, we show that k-anonymity does not prevent predicate singling out attacks. Instead, it enables an adversary to predicate single out with probability approximately 37%, even using extremely low-weight predicates for which the baseline risk is negligible. Briefly, the attack begins by observing that typical k-anonymous algorithms “almost” predicate single out. They reveal simple predicates—usually, conjunctions of attributes—that are satisfied by groups of k rows in the dataset. In an effort to make the k-anonymized data as useful as possible, these predicates are as descriptive and specific as possible. To predicate single out a row from the k-anonymous dataset, it roughly suffices to isolate a row from within one of these groups.

C. Implication for the GDPR.

Precisely formalizing predicate singling out attacks allows us to examine with mathematical rigor the extent to which specific algorithms and paradigms protect against them. In particular, we show that k-anonymity fails to prevent predicate singling out, but that differential privacy prevents predicate singling out. Our conclusions contrast with those of the Article 29 Working Party: They conclude that k-anonymity eliminates the risk of singling out, while differential privacy “may not” (5). This disagreement may raise doubts about whether our modeling indeed matches the regulators’ intent, which we now address.

Our goal in interpreting the text of the GDPR and related documents, and in defining predicate singling out, is to provide a precise mathematical formalism to capture some aspect of the concept of personal data (as elucidated in the regulation and in ref. 7) and the associated concept of anonymization. We want to render mathematically falsifiable a claim that a given algorithmic technique anonymizes personal data under the GDPR by providing a necessary condition for such anonymizers.

We argue that predicate singling out succeeds. A number of modeling choices limit the scope of our definition, but limiting the scope poses no issue. Specifically, 1) we only consider randomly sampled datasets; 2) we only consider an attacker who has no additional knowledge of the dataset besides the output of a mechanism; and 3) we do not require that isolation be impossible, instead comparing against a baseline risk of isolation. A technique that purports to anonymize all personal data against all attackers must at least do so against randomly sampled data and against limited attackers. And unless the idea of anonymization mechanisms is completely vacuous, one must compare against a baseline risk.

We don’t mean to claim that our modeling is the only one possible. The starting point for the analysis is a description which does not use mathematical formalism, but is, rather, a (somewhat incomplete) description using natural language. Alternative mathematical formalizations of singling out could probably be extracted from the very same text, and we look forward to seeing them emerge.

Policy Implications.

Assuming our formalization is in agreement with the GDPR’s notion of singling out, the most significant consequence of our analysis is that k-anonymity does not prevent singling out (and likewise -diversity and t-closeness). Thus, it is insufficient for rendering personal data anonymous and excepting them of GDPR regulation. If PSO security is a necessary condition for GDPR anonymization, then something more is required. On the other hand, differential privacy might provide the necessary level of protection. At least it is not ruled out by our analysis.

More abstractly, we believe that self-composition is an essential property of any reasonable privacy concept. Our finding that PSO security does not self-compose is evidence that self-composition should not be taken for granted, but be a criterion considered when developing privacy concepts.

One may still claim that the assessments made in ref. 5 should be taken as ground truth and that the Article 29 Working Party meant for any interpretation of singling out to be consistent with these assessments. According to this view, the protection provided by k-anonymity implicitly defines the meaning of singling out (partially or in full). Such a position would be hard to justify. To the best of our knowledge, the assessments made by the Article 29 WP were not substantiated by a mathematical analysis. Defining privacy implicitly as the guarantee provided by particular techniques is an approach proven to fail (1).

Is PSO Security a Good Privacy Concept?

A predicate singling out attack can be a stepping stone toward a greater harm, even in settings where isolation alone may not. It may enable linking a person’s record in the dataset to some external source of information (15), or targeting individuals for differential treatment. As such, it is meaningful as a mode of privacy failure, both in the GDPR context and otherwise.

While we believe that PSO security is relevant for the GDPR as a necessary property of techniques that anonymize personal data, we do not consider it sufficiently protective by itself. First, singling out is only one mode of privacy failure. Many other failure modes have been considered, including reconstruction attacks, membership attacks, inference, and linkage. Second, our definition considers a setting where the underlying data are drawn from some (unknown) underlying distribution, an assumption that is not true in many real-life contexts. In such contexts, PSO security may not prevent singling out under the GDPR. Lastly, the incomposability of PSO security renders it inadequate in isolation and suggests that it should be complemented by other privacy requirements.

3. Notation

A dataset x=(x1,,xn) consists of n elements taken from the data universe X={0,1}d. We consider datasets where each entry xi is independently sampled from a fixed probability distribution D over X. We denote by Ud a random variable sampled uniformly at random from {0,1}d.

A mechanism M is a Turing machine that takes as input a dataset xXn. Mechanisms may be randomized and interactive.

A predicate is a binary-valued function p:X{0,1}. We define the weight of a predicate p to be wtD(p)PrxD[p(x)=1]. For a dataset xXn, let p(x)1ni=1np(xi). We occasionally use indicator notation I() to define a predicate; for example, p(x)=I(xA) equals 1 if xA and 0 otherwise.

For the purposes of asymptotic analyses, we use the number of rows n in a dataset as the complexity parameter. We omit the dependence of d=d(n) on n.A function f(n) is negligible, denoted f(n)=negl(n), if for all c>0, there exists nc>0 such that f(n)nc for all n>nc. Informally, this means that f(n) approaches 0 faster than any inverse polynomial function for large enough n.

Many of our results apply to all data distributions with sufficient “min-entropy,” a measure of randomness useful for cryptography. The min-entropy of a probability distribution D over a universe X is H(D)=logmaxyXPrxD[x=y]. The relevant results apply to all distributions with even a moderate amount of min-entropy: H(D)>ω(log(n))+log(1/α) for some α=negl(n) (e.g., H(D)>log1+c(n) for c>0). We will call such distributions “min-entropic.”

Fact 3.1. For any min-entropic D and any weight w[0,1], there exist predicates p, p+ and a negligible function negl(n) such that wtD(p)[wnegl(n),w] and wtD(p+)[w+negl(n)] [using the Leftover Hash Lemma (16)].

4. Security Against Predicate Singling Out

Consider a data controller who has in its possession a dataset x=(x1,,xn) consisting of n rows sampled independently from a probability distribution D. We think of the dataset as containing the personal data of n individuals, one per row. The data controller publishes the output of an anonymization mechanism M applied to the dataset x. A predicate singling out adversary A is a nonuniform probabilistic Turing machine with access to the published output M(x) and produces a predicate p:X{0,1}. We abuse notation and write A(M(x)), regardless of whether M is an interactive or noninteractive mechanism. For now, we assume that all adversaries have complete knowledge of D and are computationally unbounded**

The Article 29 Working Party describes singling out as “isolat[ing] some or all records which identify an individual in [a] dataset” (5). This is done by “narrowing down [to a singleton] the group to which [the individual] belongs” by specifying “criteria which allows him to be recognized” (7). We call this row isolation.

Definition 4.1 (Row Isolation): A predicate p isolates a row in a dataset x if there exists a unique xx such that p(x)=1. That is, if p(x)=1/n. We denote this event iso(p,x).

It is tempting to require that an anonymization mechanism M only allows an adversary to isolate a row with negligible probability, but this intuition is problematic. An adversary that does not have access to M—a trivial adversary—can output a predicate p with wtD(p)1/n (by Fact 3.1) and thereby isolate a row in x with probability

n1wtD(p)(1wtD(p))n1(11n)n11e37%.

In Bounding the Baseline, we will see that, in many cases, the trivial adversary need not know the distribution to produce such a predicate.

A. Security Against Predicate Singling Out.

We first define a measure of an adversary A’s success probability in isolating a row given the output of a mechanism M while being restricted to output a predicate p of at most a given weight w.

Definition 4.2 (Adversarial Success Probability): Let D be a distribution over X. For mechanism M, an adversary A, a dataset size nN, and a weight w1/n let

SuccwA,M(n,D)PrxDnpA(M(x))[iso(p,x)wtD(p)w].

Instead of considering the absolute probability that an adversary isolates a row, we consider the increase in probability relative to a baseline risk: the probability of isolation by a trivial adversary, T.

Definition 4.3 (Trivial Adversary): A predicate singling out adversary T is trivial if the distribution over outputs of T is independent of M(x). That is, T(M(x))=T().

Definition 4.4 (Baseline): For nN and weight w1/n,

baseD(n,w)maxTrivialTSuccwT,(n,D).

The baseline lets us refine the Working Party’s conception of singling out as row isolation. We require that no adversary should have significantly higher probability of isolating a row than a trivial adversary, when both output predicates of weight less than w.

Definition 4.5 (Security Against Predicate Singling Out): For ε(n)0, δ(n)0, wmax(n)1/n, we say a mechanism M is (ε,δ,wmax)-secure against predicate singling out ((ε,δ,wmax)-PSO secure) if for all A, D, n, and wwmax:

SuccwA,M(n,D)eε(n)baseD(n,w)+δ(n).

We often omit explicit reference to the parameter n for ε, δ, wmax, and to the distribution D when it is clear from context.

The definition is strengthened as ε and δ get smaller. As shown next, base(n,w)=negl(n) when w=negl(n). This is the most important regime of Definition 4.5, as such predicates not only isolate a row in the dataset, but likely also isolate an individual in the entire population.

We say a mechanism is PSO secure if for all wmax=negl(n) there exists δ=negl(n) such that M is (0,δ,w(n)wmax)-PSO secure. Observe that for all ε=O(logn), δ=negl(n), and wmax1/n, (ε,δ,wmax)-PSO security implies PSO security.††

B. Bounding the Baseline.

In this section, we characterize the baseline over intervals in terms of a simple function B(n,w). For n2 and a predicate p of weight w, the probability over xDn that p isolates a row in x is

B(n,w)nw(1w)n1.

B(n,w) is maximized at w=1/n and strictly decreases moving away from the maximum. (11/n)n1 approaches e1 as n and does so from above (recall that (11/n)ne1 even for relatively small values of n).

Claim 4.1. For all n, w1/n, and D, baseD(n,w)B(n,w). For min-entropic D, baseD(n,w)B(n,w)negl(n).

Proof: Observe that for any randomized trivial adversary T, there exists a deterministic trivial adversary T such that SuccwT,(n)SuccwT,(n). Therefore, without loss of generality, one can assume that the trivial adversary that achieves the baseline success probability is deterministic.

A deterministic T always outputs a predicate p of weight wt(p)=w.

PrxDn[iso(p,x)]=n1w(1w)n1.

Therefore,

base(n,w)=supwww=wt(p)nw(1w)n1B(n,w),

where the last inequality follows from the monotonicity of B(n,w) in the range w[0,w]. By Fact 3.1, there exists p such that wt(p)wnegl(n). Hence, base(n,w)B(n,wt(p))B(n,w)negl(n).

Examples: For a (possibly randomized) function f, consider the mechanism Mf that on input x=(x1,,xn) outputs Mf(x)=(f(x1),,f(xn)). Whether Mf is PSO secure depends on the choice of f:

  • Identity function: If f(x)=x, then Mf(x)=(x1,,xn) and every distinct xix can be isolated by a predicate pi:xI(x=xi). If PrxD[x=xi]=negl(n) (e.g., when D is uniform over {0,1}ω(log(n)) or D is min-entropic), then wtD(pi)=negl(n). Hence, Mf is not PSO secure.

  • Pseudonymization: If f is one-to-one and public, it offers no more protection than the identity function. For unique xix and yi=f(xi), the predicate pi:xI(f(x)=yi) isolates xi. If D is min-entropic, wtD(pi)=negl(n). Observe that Mf is not PSO secure, even if f1 is not efficiently computable. Furthermore, f being many-to-one does not guarantee PSO security. For instance, suppose the data are uniform over {0,1}n and f:{0,1}n{0,1}n/2 outputs the last n/2 bits of an input x. Mf is not PSO secure. In fact, it is possible to single out every row xx using the same predicates pi as above. Together, these observations challenge the use of some forms of pseudonymization.

  • Random function: If f(x) is a secret, random function, then Mf(x) carries no information about x and provides no benefit to the adversary over a trivial adversary. For every x, a trivial adversary T can perfectly simulate any adversary A(Mf(x)) by executing A on a random input. Hence, Mf is PSO secure.

C. Reflections on Modeling.

In many ways, Definition 4.5 demands the sort of high level of protection typical in the foundations of cryptography. It requires a mechanism to provide security for all distributions D and against nonuniform, computationally unbounded adversaries.‡‡ The main weakness in the required protection is that it considers only data that are independent and identically distributed (i.i.d.), whereas real-life data cannot generally be modeled as i.i.d.

Any mechanism that purports to be a universal anonymizer of data under the GDPR—by transforming personal data into nonpersonal data—must prevent singling out. Our definition is intended to capture a necessary condition for a mechanism to be considered as rendering data sufficiently anonymized under the GDPR. Any mechanism that prevents singling out in all cases must prevent it in the special case that the data are i.i.d. from a distribution D and for wmax=negl(n). We view a failure to provide security against predicate singling out (Definition 4.5) as strong evidence that a mechanism does not prevent singling out as conceived of by the GDPR.

On the other hand, satisfying Definition 4.5 is not sufficient for arguing that a mechanism renders data anonymized under the GDPR. Singling out is only one of the many “means reasonably likely to be used” (4) to identify a person in a data release. Furthermore, the definition considers only i.i.d. data; it may not even imply that a mechanism prevents singling out in other relevant circumstances.

It is important that our definitions are parameterized by the weight w. An unrestricted trivial adversary can isolate a row with probability of about 1/e37% by outputting a predicate of weight about 1/n. If 37% was used as the general baseline (without consideration of the weight of the predicate), then the definition would permit an attacker to learn very specific information about individuals in a dataset, as long as it does so with probability less than 37%. For instance, such a definition would permit a mechanism that published a row from the dataset with a one in three chance.

Remark 4.2. The baseline is also negligible when the trivial adversary is required to output predicates with weight at least ω(logn/n). It is not clear to the authors how beneficial finding a predicate in this regime may be to an attacker. This high-weight regime is analyzed analogously in ref. 9.

5. Properties of PSO Security

Two desirable properties of privacy concepts are 1) immunity to postprocessing (i.e., further processing of the outcome of a mechanism, without access to the data, should not increase privacy risks), and 2) closure under composition (i.e., a combination of two or more mechanisms which satisfy the requirements of the privacy concept is a mechanism that also satisfies the requirements, potentially with worse parameters). It is easy to see that PSO security withstands postprocessing. However, it does not withstand composition, and we give two demonstrations for this fact.

First, we consider mechanisms which count the number of dataset rows satisfying a property and show that every such mechanism is PSO secure. However, there exists a collection of ω(log(n)) counts which allows an adversary to isolate a row with probability arbitrarily close to one using a predicate with negligible weight. Second, we construct a (less natural) pair of mechanisms that are individually PSO secure, but together allow the recovery of a row in the dataset. This construction borrows ideas from ref. 17.

Not being closed under composition is a significant weakness of the notion of PSO security. Our constructions rely on very simple mechanisms that we expect would be considered as preventing singling-out attacks (as a legal matter), even under other formulations of the concept. As such, it may well be that nonclosure under composition is an inherent property of the concept of singling out.

As a policy matter, we believe that closure under composition (as well as immunity to postprocessing) should be considered prerequisites for any privacy concept deemed sufficient to protect individuals’ sensitive data. Pragmatically, the fact that PSO security is not closed under composition suggests that this concept is best used for disqualifying privacy technology (i.e., if it is not PSO secure). This concept should not be used alone to certify or approve the use of any technology.

A. A PSO-Secure Counting Mechanism.

For any predicate q:X{0,1}, we define the corresponding Counting Mechanism:

Algorithm 1: Counting Mechanism M#q:
input: x=(x1,,xn)
return |{1in:q(xi)=1}|

For example, consider the least-significant bit predicate lsb, that takes as input a string x{0,1}d and outputs the first bit x[1]. The corresponding Counting Mechanism M#lsb returns the sum of the first column of x.

M#q is PSO secure for any predicate q. This is a corollary of the following proposition.

Proposition 5.1. For all A, n>0, w1/n, and M:XnY, SuccwA,M(n)|Y|base(n,w).

Proof: We define a trivial adversary T such that for all A, SuccwT,(n)1|Y|SuccwA,M(n). The proposition follows by the definition of base(n,w). T samples a random yRY and returns pA(y). For all datasets x, there exists y*=y*(x)Y such that

PrpA(y*)[iso(p,x)wt(p)w]PrpA(M(x))[iso(p,x)wt(p)w].

For all x, PryRY[y=y*]=1|Y|. Therefore,

SuccwT,(n)=PrxDnyRYpA(y)[iso(p,x)wt(p)w]SuccwA,M(n)|Y|.

As exact counts are not differentially private, the PSO security of M#q demonstrates that differential privacy (Definition 6.1) is not necessary for PSO security. The security of a single exact count easily extends to O(1)-many counts (even adaptively chosen), as the size of the codomain grows polynomially.

B. Failure to Compose.

Our next theorem states that a fixed set of ω(log(n)) counts suffices to predicate single out with probability close to e1. For a collection of predicates Q=(q0,,qm), let M#Q(x)(M#q0(x),,M#qm(x)). Let D=Ud be the uniform distribution over X={0,1}d for some d=ω(log(n)).

Theorem 5.2. For any md, there exists a Q and A such that

Succ2mA,M#Q(n)B(n,1/n)negl(n).

Choosing m=ω(log(n)) yields 2m=negl(n).

Proof: Treating x{0,1}d as a binary number in [0,2d1], let q0(x)=I(x<2d/n). Observe that wt(q0)=1/nnegl(n).

For i{1,,m}, define the predicate qi(x)(q0(x)x[i]), and let yi=M#qi(x). Observe that if it happens that q0 isolates row j* of x, then yi=xj*[i]. Consider the deterministic adversary A that on input M#Q(x)=(y0,,ym) outputs the predicate

p(x)=q0(x)i=1mx[i]=yi.

Observe that iso(q0,x)iso(p,x) and that by construction wt(p)2m. Thus,

Succ2mA,M#Q(n)=PrxUdnpA(M#Q(x))[iso(p,x)]PrxUdnpA(M#Q(x))[iso(q0,x)]B(n,1/n)negl(n).

Observe that iso(q0,x)iso(p,x) and that by construction wt(p)2m. Thus,

Succ2mA,M#Q(n)=PrxUdnpA(M#Q(x))[iso(p,x)]PrxUdnpA(M#Q(x))[iso(q0,x)]B(n,1/n)negl(n).

Remark 5.3. When the attack succeeds, all of the predicates qi match 0 or 1 rows in x. It may seem that an easy way to counter the attack is by masking low counts, a common measure taken, e.g., in contingency tables. However, it is easy to modify the attack to only use predicates Q matching Θ(n) rows using one extra query. This means that restricting the mechanism to suppress low counts cannot prevent this type of attack. Let q* be a predicate with wtUm(q*)=1/2 (e.g., parity of the bits), and let qi*=qiq*. The attack succeeds whenever q*(x)=q0*(x)+1. If q*(x) and q0(x) are independent, then this occurs with probability at least 12B(n,1/n)negl(n). As before, the probability can be amplified to 1negl(n).

While a single count is PSO secure for any data distribution, the above attack against ω(log(n)) counts applies only to the uniform distribution Ud. We can extend the attack to general min-entropic distributions D at the cost of randomizing the attacked mechanism M (i.e., the set of predicates Q). Furthermore, for min-entropic D, this probability can be amplified to 1negl(n) by repetition (9).

C. Failure to Compose Twice.

Borrowing ideas from refs. 17 and 18, we construct two mechanisms Mext and Menc which are individually PSO secure (for arbitrary distributions), but which together allow an adversary to single out with high probability when the data are uniformly distributed over the universe X={0,1}d. With more work, this composition attack can be extended to more general universes and to min-entropic distributions.

Theorem 5.4. Mext and Menc described below are PSO secure. For m=ω(log(n)) and mmin(d,(n1)/4), X={0,1}d, and D=Ud the uniform distribution over X, there exists an adversary A such that

Succ2mA,MExtEnc(n)1negl(n),

where MExtEnc=(Mext,Menc).

We divide the input dataset into two parts. We treat xext=(x1,,xn1) as a “source of randomness” and xn as a “messsage.” Mext(x) outputs an encryption secret key s based on the rows in xext, using the von Neumann extractor as described in Algorithm 2.

graphic file with name pnas.1914598117fx01.jpg

Menc(x) runs sMext. If s, it outputs c=sxn[1:m] (using s as a one-time pad to encrypt the first m bits of xn); otherwise, it outputs . Alone, neither s nor c allows the adversary to single out, but using both, an adversary can recover xn[1:m] and thereby predicate single out the last row.

Proof Outline: We must prove that Mext and Menc are PSO secure and that MExtEnc is not. Note that the security of Mext and Menc do not follow merely from the fact that their outputs are nearly uniform.§§

Mext is (ln(2),0,1/n)-PSO secure.

Consider the mechanism Mextσ(x) that samples a random permutation σ:[n][n] and outputs Mext(σ(x)). For any x, Mextσ(x) is uniform conditioned on not outputting . Its security is equivalent to a mechanism outputting a single bit of information, which itself is (ln(2),0,1/n)-PSO secure by Proposition 5.1 (and therefore also PSO-secure by the observation at the end of Security Against PSO). To complete the proof, one can show that SuccwA,Mext(n)=SuccwA,Mextσ(n).

Menc is PSO-secure.

We separately consider the two possible values of p(xn), where p is the predicate returned by A. Let SuccwA,Menc(n)=γ0+γ1, where

γbPriso(p,x)wtD(p)wp(xn)=b.

The output of Menc(x) is deterministic and information-theoretically independent of xn. Thus, for any w=negl(n),

γ1Prx,A[p(xn)=1wtD(p)w]w=negl(n).

If A singles out and p(xn)=0, then it is effectively singling out against the subdataset xn=(x1,,xn1). That is,

γ0=Prx,A[iso(p,xn)wtD(p)wp(xn)=0].

We construct B that tries to single out against mechanism Mext using A. On input s, B samples xnD and runs pA(sxn[1:m]).

SuccwB,Mext(n)Priso(p,xn)wtD(p)wp(xn)=0Pr[p(xn)=0wtD(p)w]γ0(1negl(n)).

By the PSO security of Mext, γ0 is negligible.

Insecurity of MExtEnc for D=Ud.

The output of MExtEnc(x) is a pair (s,c). If (s,c)=(,), A aborts. By a Chernoff Bound, for m(n1)/4, Prx[s=]e(n1)/16=negl(n). If (s,c)(,), A recovers xn=cs and outputs the predicate

p(x)=x[1:m]=xn[1:m].

By the choice of m=ω(log(n)), wtUd(p)=2m<negl(n). Pr[iso(p,x)s]=1Pr[jn:xj[1:m]=xn[1:m]]=1n2m>1negl(n). The bound on Succ2mA,MExtEnc follows, completing the proof of the claim and the theorem.

D. Singling Out and Failure to Compose.

The failure to compose demonstrated in Theorem 5.2 capitalizes on the use of multiple counting queries. Such queries underlie a large variety of statistical analyses and machine learning algorithms. We expect that other attempts to formalize security against singling out would also allow counting queries. If so, our negative composition results may extend to other formalizations.

The failure to compose demonstrated in Theorem 5.4 is more contrived. We expect that other attempts to formalize security against singling out would allow mechanisms like Mext—ones where for every input x randomly permuting the input and applying the mechanism results in the uniform distribution over outputs (as in the proof of Theorem 5.4). It is less clear to us whether other possible formalizations of security against singling out would allow a mechanism like Menc. If it is to compose, it likely must not.

6. Differential Privacy Provides PSO Security

In this section, we demonstrate that differential privacy implies PSO security. Because exact counts are not differentially private but are PSO secure (Proposition 5.1), we have already shown that PSO security does not imply differential privacy.

Recall that we model xXn as containing personal data of n distinct individuals. For x,xXn, we write xx if x and x differ on the data of exactly one individual xi.

Definition 6.1 [Differential Privacy (10, 19)] A randomized mechanism M:XnT is (ε,δ)-differentially private if for all x,xXn,xx and for all events ST,

Pr[M(x)S]eεPr[M(x)S]+δ,

where the probability is taken over the randomness of the mechanism M.

Our analysis relating PSO security to differential privacy is through a connection of both concepts to statistical generalization. For differential privacy, this connection was established in refs. 13 and 14. We use a variant of the latter from ref. 20:

Lemma 6.1 (Generalization Lemma). For all distributions D and for all (ε,δ)-differentially private algorithms A:xp operating on a dataset x and outputting a predicate p:X{0,1}

ExDnEpA(x)p(x)eεExDnEpA(x)wtD(p)+δ.

Theorem 6.2. For all ε=O(1), δ=negl(n), and wmax1/n, if M is (ε,δ)-differentially private, then M is (ε,δ,wmax)-PSO secure for

ε=ε+(n1)ln11wmaxε+1  and  δ=nδ+negl(n).

In particular, for wmax=o(1/n), ε=ε+o(1).

Proof: For simplicity of exposition, we present the proof for min-entropic distributions D. The proof for general distributions follows from a similar argument.

Given pA(M(x)), wwmax, and D, define the predicate p*:

p*(x)p(x)ifwtD(p)w0ifwtD(p)>w.

Observe that wtD(p*)w. The predicate p* can be computed from p, D, and w without further access to x. Because differential privacy is closed under postprocessing, if M is (ε,δ)-differentially private, then the computation that produces p* is (ε,δ)-differentially private as well. Recall iso(p,x)p(x)=1/n.

SuccwA,M(n)=Prx,p[p(x)=1/nwtD(p)w]Prx,p[p(x)1/nwtD(p)w]=Prx,p[p*(x)1/n]nEx,p[p*(x)]n(eεw+δ)     (Lemma 6.1)=eεB(n,w)(1w)n1+nδeεbase(n,w)(1w)n1+nδ+negl(n)  (Claim 4.1)eεbase(n,w)+δ.

7. Does k-Anonymity Provide PSO Security?

k-anonymity (11, 12) is a strategy intended to help a data holder “release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful” (12). It is achieved by making each individual in a data release indistinguishable from at least k1 individuals along certain attributes. Typically, a k-anonymized dataset is produced by subjecting it to a sequence of generalization and suppression operations.

In this section, we analyze the extent to which k-anonymity provides PSO security. We show that a k-anonymized dataset typically provides an attacker information sufficient to PSO with constant probability. This result challenges the determination of the Article 29 Working Party.¶¶

A. Preliminaries.

Let (A1,,Am) be attribute domains. A dataset x=(x1,,xn) is a collection of rows xi=(ai,1,,ai,m), where ai,jAj. For subsets a^i,jAj, we view yi=(a^i,1,,a^i,m) as a set in the natural way, writing xiyi if j[m], ai,ja^i,j. We say that a dataset y=(y1,,yn) is derived from x by generalization and suppression if i[n], xiyi. For example, if (A1,A2,A3) correspond to “5-Digit ZIP Code,” “Gender,” and “Year of Birth,” then it may be that xi=(91015,F,1972) and yi=(9101091019,,19701975), where denotes a suppressed value.

k-anonymity aims to capture a sort of anonymity in a (very small) crowd: A data release y is k-anonymous if any individual row in the release is indistinguishable from k1 other individual rows. Let count(y,y)|{i[n]:yi=y}| be the number of rows in y which agree with y.***

Definition 7.1 (k-Anonymity [Rephrased from Ref. 12]): For k2, a dataset y is k-anonymous if count(y,yi)k for all i[n]. An algorithm is called a k-anonymizer if on an input dataset x its output is a k-anonymous y which is derived from x by generalization and suppression.

For k-anonymous dataset y=(y1,,yn), let ϕ1(x)=I(xy1) be the predicate that returns 1 if y1 could have been derived from x by generalization and suppression. Let xϕ1={xx:ϕ1(x)=1}. We assume for simplicity that k-anonymizers always output y such that |xϕ1|=k, but the theorem generalizes to other settings.

B. k-Anonymity Enables Predicate Singling Out.

Before presenting the main theorem of this section, we provide an example of a very simple k-anonymizer that fails to provide security against predicate singling out. Let D=Un be the uniform distribution over U={0,1}n.

Consider the k-anonymizer that processes groups of k rows in index order and suppresses all bit locations where any of the k rows disagree. Namely, for each group g=1,,n/k of k rows (xgk+1,,xgk+k), it outputs k copies of the string yg{0,1,}n, where yg[j]=b{0,1} if xgk+1[j]==xgk+k[j]=b (i.e., all of the k rows in the group have b as their jth bit) and yg[j]= otherwise.

The predicate ϕ1(x) evaluates to 1 if y1[j]{x[j],} for all j[n] and evaluates to 0 otherwise. Namely, ϕ1(x) checks whether x agrees with y1 (and, hence, with all of x1,,xk) on all nonsuppressed bits.

In expectation, n/2k1 positions of y1 are not suppressed. For large enough n, with high probability over the choice of x, at least n22k1 positions in y1 are not suppressed. In this case, wtD(ϕ1)2n2k which is negl(n) for any constant k.

We now show how ϕ1 can be used adversarially. In expectation, n(12(k1))n/2 positions of y1 are suppressed. For large enough n, with high probability over the choice of x at least n/4 of the positions in y1 are suppressed. Denote these positions i1,,in/4. Define the predicate pk(x) that evaluates to 1 if the binary number resulting from concatenating x[i1],x[i2],,x[in/4] is greater than 2n/4/k and 0 otherwise. Note that wtD(pk)1/k and, hence, pk isolates within group 1 with probability 1/e0.37, as was the case with the trivial adversary described after Definition 4.1.

An attacker observing ϕ1 can now define a predicate p(x)=ϕ1(x)pk(x). By the analysis above, wt(p) is negligible (as it is bounded by wt(ϕ1)) and p(x) isolates a row in x with probability 0.37. Hence, the k-anonymizer of this example fails to protect against singling out.

Theorem 7.1 captures the intuition from this example and generalizes it, demonstrating that k-anonymity does not typically protect against predicate singling out.

Theorem 7.1. For any k2, there exists an algorithm A such that for all min-entropic D, all k-anonymizers Anon, and all w1/n:

SuccwA,Anon(n)ηB(k,1/k)negl(n)ηe,whereηPrxDnyAnon(x)wtD(ϕ1)w.

To predicate single out, the A must output a predicate that both isolates x and has low weight. The theorem states that these two requirements essentially decompose: η is the probability that the predicate k-anonymizer induces a low-weight predicate ϕ1, and B(k,1/k) is the probability that a trivial adversary predicate singles out the subdataset xϕ1 of size k. Algorithms for k-anonymity generally try to preserve as much information in the dataset as possible. Thus, we expect such algorithms to typically yield low-weight predicates ϕ1 and correspondingly high values of η.

Proof Outline: A will construct some predicate q and output the conjunction pϕ1q. Noting that wtD(p)wtD(ϕ1), and that iso(q,xϕ1)iso(p,x),

SuccwA,Anon(n)Priso(q,xϕ1)wtD(ϕ1)w=ηPriso(q,xϕ1)wtD(ϕ1)w. [1]

It remains only to show that for min-entropic distributions D, Priso(q,xϕ1)wtD(ϕ1)wB(k,1/k)negl(n). This claim is reminiscent of Fact 3.1, but with an additional challenge. The rows in xϕ1 are not distributed according to D; instead, they are a function of Anon and the whole dataset x. They are not independently distributed, and even their marginal distributions may be different from D. Nevertheless, for the purposes of the baseline, the rows in xϕ1 have enough conditional min-entropy to behave like random rows.

Data Availability

This paper does not include original data.

Acknowledgments

We thank Uri Stemmer, Adam Sealfon, and anonymous reviewers for their helpful technical suggestions. This material is based upon work supported by the US Census Bureau under Cooperative Agreement CB16ADR0160001 and by NSF Grant CNS-1413920. A.C. was supported by the 2018 Facebook Fellowship and Massachusetts Institute of Technology’s (MIT’s) RSA Professorship and Fintech Initiative. This work was done while A.C. was at MIT, in part while visiting Georgetown University, and while both authors visited the Simons Institute at the University of California, Berkeley. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of funders.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

*Title 13 of the US Code mandates the role of the US Census.

This point is emphasized in Recital 26: “The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”

The Article 29 Data Protection Working Party was established by the EU Data Protection Directive and issued guidance on the meaning of the Directive. The opinions we consider in this article have not been officially updated since GDPR’s passing. Moreover, GDPR tacitly endorses the Working Party’s guidance on singling out: It borrows the language of Recital 26 almost verbatim from the Data Protection Directive, but adds the phrase “such as singling out.”

§The notion of “singling out” is not defined in the Opinion on the Concept of Personal Data (7). It is used in ref. 7 four times, each consistent with the above interpretation. Our interpretation coincides with and was initially inspired by that of ref. 8, defining “singling out as occurring when an analyst correctly makes a statement of the form ‘There is exactly one user that has these attributes.’”

For completeness, one can also consider predicates of weight ω(logn/n), where the baseline is also negligible. See Remark 4.2.

More formally, we can consider an ensemble of data domains X={Xn={0,1}d(n)}nN and an ensemble of distributions D={Dn}nN, where each Dn is a distribution over Xn.

**As is typical in cryptography, strengthening the adversary to be nonuniform (including possibly having full knowledge of the distribution D) yields stronger security definition. See Reflections on Modeling, where we reexamine these choices.

††For any w(n)=negl(n), let δ(n)=eϵ(n)base(n,w(n))+δ(n)=negl(n).

‡‡It is reasonable to limit the adversary in Definition 4.5 to polynomial time. If we restricted our attention to distributions with moderate min-entropy, the results in this paper would remain qualitatively the same: Our trivial adversaries and lower bounds are all based on efficient and uniform algorithms; our upper bounds are against unbounded adversaries. Relatedly, restricting to min-entropy distributions would allow us to switch the order of quantifiers of D and T in the definition of the baseline without affecting our qualitative results.

§§For example, the mechanism that outputs x1 may be uniform, but it trivially allows singling out. Security would follow if the output was nearly uniform conditioned on x, but Mext does not satisfy this extra property.

¶¶Our results hold equally for -diversity (21) and t-closeness (22), which the Article 29 Working Party also concludes prevent singling out.

***Often count is paramaterized by a subset Q of the attribute domains called a quasi-identifier. This parameterization does not affect our analysis, and we omit it for simplicity.

References

  • 1.Ohm P., Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701–1777 (2010). [Google Scholar]
  • 2.Dwork C., McSherry F., Nissim K., Smith A., Calibrating noise to sensitivity in private data analysis. J. Priv. Confid. 7, 17–51 (2017). [Google Scholar]
  • 3.Nissim K., Wood A., Is privacy privacy? Philos. Trans. R. Soc. A. 376, 20170358 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.European Parliamentand the Council of the European Union , Regulation (EU) 2016/679, 206. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679. Accessed 1 April 2019.
  • 5.Article 29 Data Protection Working Party, Opinion 05/2014 on anonymisation techniques. https://iapp.org/media/pdf/resource_center/wp136_concept-of-personal-data_06-2007.pdf. Accessed 1 April 2019.
  • 6.Nissim K., et al. , Bridging the gap between computer science and legal approaches to privacy. Harv. J. Law Technol. 31, 687–780 (2018). [Google Scholar]
  • 7.Article 29 Data Protection Working Party, Opinion 04/2007 on the concept of personal data. https://iapp.org/media/pdf/resource_center/wp216_Anonymisation-Techniques_04-2014.pdf (2007). Accessed 1 April 2019.
  • 8.Francis P., et al. , Extended Diffix-Aspen. arXiv:1806.02075 (6 June 2018).
  • 9.Cohen A., Nissim K., Towards formalizing the GDPR’s notion of singling out. arXiv:1904.06009 (12 April 2019).
  • 10.Dwork C., McSherry F., Nissim K., Smith A., “Calibrating noise to sensitivity in private data analysis” in Third Theory of Cryptography Conference, Halevi S., Rabin T., Eds. (Lecture Notes in Computer Science, Springer, Berlin, Germany, 2006), vol. 3876, pp. 265–284. [Google Scholar]
  • 11.Samarati P., Sweeney L., “Generalizing data to provide anonymity when disclosing information” in Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Mendelzon A. O., Paredaens J., Eds. (ACM Press, New York, NY, 1998), p. 188. [Google Scholar]
  • 12.Sweeney L., k-anonymity: A model for protecting privacy. Internat. J. Uncertainty Fuzziness Knowledge-Based Syst. 10, 557–570 (2002). [Google Scholar]
  • 13.Dwork C., et al. , “Preserving statistical validity in adaptive data analysis” in Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Servedio R. A., Rubinfeld R., Eds. (ACM, New York, NY, 2015), pp. 117–126. [Google Scholar]
  • 14.Bassily R., et al. , “Algorithmic stability for adaptive data analysis” in Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Wichs D., Mansour Y., Eds. (ACM, New York, NY, 2016), pp. 1046–1059. [Google Scholar]
  • 15.Narayanan A., Shmatikov V., “Robust de-anonymization of large sparse datasets” in IEEE Symposium on Security and Privacy, 2008, SP 2008 (IEEE, Piscataway, NJ, 2008), pp. 111–125. [Google Scholar]
  • 16.Dodis Y., Ostrovsky R., Reyzin L., Smith A., Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. SIAM J. Comput. 38, 97–139 (2008). [Google Scholar]
  • 17.Nissim K., Smith A. D., Steinke T., Stemmer U., Ullman J., “The limits of post-selection generalization” in Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Bengio S., Wallach H. M., Larochelle H., Grauman K., Cesa-Bianchi N., Eds. (Curran Associates Inc, Red Hook, NY, 2008), vol. 31, pp. 6402–6411. [Google Scholar]
  • 18.Dwork C., Naor M., On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. J. Priv. Confid. 2, 93–107 (2010). [Google Scholar]
  • 19.Dwork C., Kenthapadi K., McSherry F., Mironov I., Naor M., “Our data, ourselves: Privacy via distributed noise generation” in Annual International Conference on the Theory and Applications of Cryptographic Techniques (Springer, Cham, Switzerland, 2006), pp. 486–503. [Google Scholar]
  • 20.Nissim K., Stemmer U., On the generalization properties of differential privacy. arXiv:1504.05800 (22 April 2015).
  • 21.Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M., “L-diversity: Privacy beyond k-anonymity” in 22nd International Conference on Data Engineering (ICDE’06) (IEEE Computer Society, Washington, DC, 2007), p. 1. [Google Scholar]
  • 22.Li N., Li T., Venkatasubramanian S., “t-closeness: Privacy beyond k-anonymity and l-diversity” in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, Chirkova R., Dogac A., Özsu M. T., Sellis T. K., Eds. (IEEE Computer Society, Washington, DC, 2007), pp. 106–115. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This paper does not include original data.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES