Significance
This article addresses a gap between legal and technical conceptions of data privacy and demonstrates how it can be minimized. The article focuses on “singling out,” which is a concept appearing in the GDPR. Our analysis draws on the legislation, regulatory guidance, and mathematical reasoning to derive a technical concept—“predicate singling out”—aimed at capturing a core part of GDPR’s intent. Examination of predicate singling out sheds light on the concept of singling out and the question of whether existing technologies protect against such a threat. Conceptually, this work demonstrates the role that principled analysis supported by mathematical argument can and should play in articulating and informing public policy at the interface between law and technology.
Keywords: privacy, GDPR, singling out, differential privacy, k-anonymity
Abstract
There is a significant conceptual gap between legal and mathematical thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators. We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset using the output of a data-release mechanism if it finds a predicate matching exactly one row in with probability much better than a statistical baseline. A data-release mechanism that precludes such attacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and -anonymity are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, -anonymity does not: There exists a simple predicate singling out attack under mild assumptions on the -anonymizer and the data distribution.
Data-privacy laws—like the Health Insurance Portability and Accountability Act, the Family Educational Rights and Privacy Act (FERPA), and Title 13 in the United States; and the European Union’s (EU’s) General Data Protection Regulation (GDPR)—govern the use of sensitive personal information.* These laws delineate the boundaries of appropriate use of personal information and impose steep penalties upon rule breakers. To adhere to these laws, practitioners need to apply suitable controls and statistical disclosure-limitation techniques. Many commonly used techniques, including pseudonymization, -anonymity, bucketing, rounding, and swapping, offer privacy protections that are seemingly intuitive, but only poorly understood. And while there is a vast literature of best practices, a litany of successful privacy attacks demonstrates that these techniques often fall short of the sort of privacy envisioned by legal standards (e.g., ref. 1).
A more disciplined approach is needed. However, the significant conceptual gap between legal and mathematical thinking around data privacy poses a challenge. Privacy regulations are grounded in legal concepts, such as personally identifiable information (PII), linkage, distinguishability, anonymization, risk, and inference. In contrast, much of the recent progress in data-privacy technology is rooted in formal mathematical privacy models, such as differential privacy (2), that offer a foundational treatment of privacy, with provable privacy guarantees, meaningful semantics, and composability. And while such models are being actively developed and implemented in the academy, industry, and government, there is a lack of discourse between the legal and mathematical conceptions. The effect is uncertainty as to which technical offerings adequately match expectations expressed in legal standards (3).
Bridging between Legal and Technical Concepts of Privacy
We aim to address this uncertainty by translating between the legal and the technical. To do so, we begin with a concept appearing in the law, then model some aspect of it mathematically. With the mathematical formalism in hand, we can better understand the requirements of the law, their implications, and the techniques that might satisfy them.
In particular, we study the concept of “singling out” from the GDPR (4). We examine what it means for a data-anonymization mechanism to ensure “security against singling out” in a data release. Preventing singling out attacks in a dataset is a necessary (but maybe not sufficient) precondition for a dataset to be considered effectively anonymized and thereby free from regulatory restrictions under the GDPR. Ultimately, our goal is to better understand a concept foundational to the GDPR, by enabling a rigorous mathematical examination of whether certain classes of techniques (i.e., -anonymity and differential privacy) provide security against singling out.
We are not the first to study this issue. The Article 29 Data Protection Working Party, an EU advisory body, provided guidance about the use of various privacy technologies—including -anonymity and differential privacy—as anonymization techniques (5). Its analysis is centered on asking whether each technology effectively mitigates three risks: “singling out, linkability, and inference.” For instance, their “Opinion on Anonymisation Techniques” concludes that with -anonymity, singling out is no longer a risk, whereas with differential privacy, it “may not” be a risk (5). Though similar in purpose to our work, its technical analyses are informal and coarse. Revisiting these questions with mathematical rigor, we recommend reconsidering the Working Party’s conclusions.
This work is part of a larger effort to bridge between legal and technical conceptions of privacy. An earlier work analyzed the privacy requirements of FERPA and modeled them in a game-based definition, as is common in cryptography (6). The definition was used to argue that the use of differentially private analyses suffices for satisfying a wide range of interpretations of FERPA. An important feature of FERPA that enabled that analysis is that FERPA and its accompanying documents contain a rather detailed description of a privacy attacker and the attacker’s goals.
1. Singling Out in the GDPR
We begin with the text of the GDPR (4). It consists of articles detailing the obligations placed on processors of personal data, as well as recitals containing explanatory remarks. Article 1 of the regulation delineates its scope:
This regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data and rules relating to the free movement of personal data.
On the other hand, the GDPR places no restrictions on the processing of nonpersonal data, which includes personal data that have been “rendered anonymous.”†
The term “personal data” is defined in Article 4 of the GDPR to mean “any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly.” What it means for a person to be “identified, directly or indirectly” is further elaborated in Recital 26:
To determine whether a natural person is identifiable, account should be taken of all of the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.
Thus, singling out is one way to identify a person in data, and only data that do not allow singling out may be excepted from the regulation.
Singling out is the only criterion for identifiability explicitly mentioned in the GDPR, the only occurrence of the term being the passage quoted above. For insight as to the meaning of singling out, we refer to two documents prepared by the Article 29 Data Protection Working Party.‡“Opinion on the Concept of Personal Data” (7) elaborates on the meaning of “identifiable, directly or indirectly.” A person is identified “within a group of persons [when] he or she is distinguished from all other members of the group.” One way of distinguishing a person from a group is by specifying “criteria which allows him to be recognized by narrowing down the group to which he belongs.” If the group is narrowed down to an individual, that individual has been singled out.§ Similarly, the Working Party’s “Opinion on Anonymisation Techniques” describes singling out as “isolat[ing] some or all records which identify an individual in [a] dataset.” Looking ahead, we will call this “isolating” an individual in the dataset and argue that not every instance of isolation should be considered a singling-out attack.
We highlight three additional insights from ref. 7 that inform our work. First, identification does not require a name or any other traditional identifier. For instance, singling out can be done with a “small or large” collection of seemingly innocuous traits (e.g., “the man wearing a black suit”) (7). Indeed, this is what is meant by indirectly identifiable. An example of singling out in practice cited by the Working Party “Opinion on Anonymisation Techniques” (5) showed that four locations sufficed to uniquely identify 95% of people in a pseudonymized dataset of time-stamped locations. This is considered singling out, even without a method of linking the location traces to individuals’ names.
Second, identifiable data may come in many forms, including microdata, aggregate statistics, news articles, encrypted data, video footage, and server logs. What’s important is not the form of the data—it is whether the data permit an individual to be singled out. We apply this same principle to the manner in which an individual is singled out within a dataset. Most examples focus on specifying a collection of attributes (e.g., four time-stamped locations) that match a single person in the data. A collection of attributes can be viewed as corresponding to a “predicate”: a function that assigns to each person in the dataset a value 0 or 1 (interpreted as or , respectively). We interpret the regulation as considering data to be personal data if an individual can be isolated within a dataset using any predicate, not only those that correspond to collections of attributes.
Third, whether or not a collection of attributes identifies a person is context-dependent. “A very common family name will not be sufficient to identify someone—i.e., to single someone out—from the whole of a country’s population, while it is likely to achieve identification of a pupil in a classroom” (7). Both the prevalence of the name and the size of the group are important in the example and will be important in our formalization.
2. Main Findings
A. Defining Security Against Predicate Singling Out.
We formalize and analyze “predicate singling out,” a notion which is intended to model the GDPR’s notion of singling out. Following the discussion above, we begin with the idea that singling out an individual from a group involves specifying a predicate that uniquely distinguishes the individual, which we call “isolation.” Using this terminology, an intuitive interpretation of the GDPR is that rendering data anonymous requires preventing isolation. Trying to make this idea formal, we will see that it requires some refinement.
We restrict our attention to datasets of size , where each row is sampled independently from some underlying probability distribution over a universe . The dataset is assumed to contain personal data corresponding to individuals, with at most one row per individual. For example, might consist of home listings, hospital records, internet browsing history, or any other personal information. A mechanism takes as input and outputs some data release , be it a map of approximate addresses, aggregate statistics about disease, the result of applying machine learning techniques, or pseudonymized internet histories. We call an “anonymization mechanism” because it purportedly anonymizes the personal data .
An adversary attempts to output a predicate that isolates a row in , i.e., there exists such that and for all . We emphasize that it is rows in the original dataset on which the predicate acts, not the output . In part, this is a by-product of our desire to make no assumptions on the form of ’s output. While it might make sense to apply a predicate to isolate a row in pseudonymized microdata, it is far from clear what it would mean for a synthetic dataset or for aggregate statistics. Observe that this choice also rules out predicates that “isolate” rows by referring to their position in (i.e., “the seventh row”).
prevents isolation if there doesn’t exist an adversary that isolates a row in , except with very small probability over the randomness of sampling , the randomness of the mechanism , and the randomness of adversary . Unfortunately, this is impossible to achieve by any mechanism . To wit, there are trivial adversaries—those that do not look at the outcome —that isolate a row with probability approximately . The trivial adversaries simply output any predicate that matches a fraction of the distribution . Isolation is, hence, not necessarily indicative of a failure to protect against singling out, as a trivial adversary would succeed with probability (for any ), even if does not output anything at all.
Example: Consider a dataset of size , including information about random people selected at random from the US population. To isolate a person, a trivial adversary may output . This predicate will isolate a row with probability
Furthermore, a trivial adversary need not know the distribution to isolate with probability , as long as has sufficient min-entropy (Fact 3.1).
A trivial adversary can give us a baseline against which to measure isolation success. But the baseline should not simply be a 37% chance of success. Consider the earlier example of a dataset of 365 random Americans. What if an adversary output predicates like , and managed to isolate 10% of the time? Though 10% is much less than 37%, the predicate is extremely specific and unlikely to isolate a person by chance.
We formalize this intuition by considering the baseline risk of isolation as a function of the weight of the predicate : the chance that matches a random row sampled from the distribution . The baseline for predicates of weight is 37%, but the baseline for an extremely specific predicate may be much lower. The more specific the predicate, the closer the baseline gets to zero. Our primary focus in this paper is on the regime of predicate weights where the baseline is negligible, corresponding to predicates with negligible weight.¶
Definition 4.5 (PSO Security, Informal): An adversary predicate singles out a row in if it outputs a predicate that isolates a row with probability significantly higher than the baseline risk. A mechanism is “secure against predicate singling out” (PSO secure) if no adversary can use its output to predicate single out.
B. Analyzing Security Against Predicate Singling Out.
Having formulated PSO security, our next goal is to understand the guarantee it offers, what mechanisms satisfy it, and how this concept relates to existing privacy concepts, including differential privacy and -anonymity.
Two desirable properties of a privacy concept are robustness to “postprocessing” and to “composition.” The former requires that if a mechanism is deemed secure, then anything that can be computed using the outcome of should also be deemed secure. Hence, the outcome may be reused without creating additional privacy risk. For instance, if a PSO-secure mechanism outputs microdata, then any statistics that can be computed from those microdata should also be PSO secure. PSO security is robust to postprocessing. The analysis is simple and shown in ref. 9.
Incomposability of PSO Security.
We would like that the privacy risk of multiple data releases is not significantly greater than the accumulated risks of the individual releases. In this case, we say that the privacy concept composes. We prove that PSO security does not compose and give two examples of this failure. First, we show that releasing a single aggregate statistic is PSO secure, but superlogarithmically many statistics may allow an adversary to predicate single out. Concretely, a mechanism that outputs a single count is PSO secure. Yet a mechanism that outputs counts may allow an adversary to isolate a row with probability arbitrarily close to one by using a predicate with negligible weight (and negligible baseline). Second, we construct a less natural pair of mechanisms that individually are PSO secure but together allow an adversary to recover a row in the dataset. Consequently, the adversary can predicate single out by isolating this row using a predicate with negligible weight.
Do Differential Privacy and -Anonymity Guarantee PSO Security?
Differential privacy is a requirement of data analyses mechanisms that limits the dependency of a mechanism’s output distribution on any single individual’s contribution (10). -anonymity is a requirement from data releases that the (quasi) identifying data of every person in the release should be identical to that of at least other individuals in the release (11, 12) (see Definitions 6.1 and 7.1 for formal definitions).
Differential privacy is not necessary for PSO security, as an exact count is PSO secure, but is not differentially private. However, differential privacy does provide PSO security. The proof relies on the connection between differential privacy and statistical generalization guarantees (13, 14). We show that predicate singling out implies a form of overfitting to the underlying dataset. If a mechanism is differentially private, it prevents this form of overfitting and, hence, protects against predicate singling out.
On the other hand, we show that -anonymity does not prevent predicate singling out attacks. Instead, it enables an adversary to predicate single out with probability approximately 37%, even using extremely low-weight predicates for which the baseline risk is negligible. Briefly, the attack begins by observing that typical -anonymous algorithms “almost” predicate single out. They reveal simple predicates—usually, conjunctions of attributes—that are satisfied by groups of rows in the dataset. In an effort to make the -anonymized data as useful as possible, these predicates are as descriptive and specific as possible. To predicate single out a row from the -anonymous dataset, it roughly suffices to isolate a row from within one of these groups.
C. Implication for the GDPR.
Precisely formalizing predicate singling out attacks allows us to examine with mathematical rigor the extent to which specific algorithms and paradigms protect against them. In particular, we show that k-anonymity fails to prevent predicate singling out, but that differential privacy prevents predicate singling out. Our conclusions contrast with those of the Article 29 Working Party: They conclude that -anonymity eliminates the risk of singling out, while differential privacy “may not” (5). This disagreement may raise doubts about whether our modeling indeed matches the regulators’ intent, which we now address.
Our goal in interpreting the text of the GDPR and related documents, and in defining predicate singling out, is to provide a precise mathematical formalism to capture some aspect of the concept of personal data (as elucidated in the regulation and in ref. 7) and the associated concept of anonymization. We want to render mathematically falsifiable a claim that a given algorithmic technique anonymizes personal data under the GDPR by providing a necessary condition for such anonymizers.
We argue that predicate singling out succeeds. A number of modeling choices limit the scope of our definition, but limiting the scope poses no issue. Specifically, 1) we only consider randomly sampled datasets; 2) we only consider an attacker who has no additional knowledge of the dataset besides the output of a mechanism; and 3) we do not require that isolation be impossible, instead comparing against a baseline risk of isolation. A technique that purports to anonymize all personal data against all attackers must at least do so against randomly sampled data and against limited attackers. And unless the idea of anonymization mechanisms is completely vacuous, one must compare against a baseline risk.
We don’t mean to claim that our modeling is the only one possible. The starting point for the analysis is a description which does not use mathematical formalism, but is, rather, a (somewhat incomplete) description using natural language. Alternative mathematical formalizations of singling out could probably be extracted from the very same text, and we look forward to seeing them emerge.
Policy Implications.
Assuming our formalization is in agreement with the GDPR’s notion of singling out, the most significant consequence of our analysis is that -anonymity does not prevent singling out (and likewise -diversity and -closeness). Thus, it is insufficient for rendering personal data anonymous and excepting them of GDPR regulation. If PSO security is a necessary condition for GDPR anonymization, then something more is required. On the other hand, differential privacy might provide the necessary level of protection. At least it is not ruled out by our analysis.
More abstractly, we believe that self-composition is an essential property of any reasonable privacy concept. Our finding that PSO security does not self-compose is evidence that self-composition should not be taken for granted, but be a criterion considered when developing privacy concepts.
One may still claim that the assessments made in ref. 5 should be taken as ground truth and that the Article 29 Working Party meant for any interpretation of singling out to be consistent with these assessments. According to this view, the protection provided by -anonymity implicitly defines the meaning of singling out (partially or in full). Such a position would be hard to justify. To the best of our knowledge, the assessments made by the Article 29 WP were not substantiated by a mathematical analysis. Defining privacy implicitly as the guarantee provided by particular techniques is an approach proven to fail (1).
Is PSO Security a Good Privacy Concept?
A predicate singling out attack can be a stepping stone toward a greater harm, even in settings where isolation alone may not. It may enable linking a person’s record in the dataset to some external source of information (15), or targeting individuals for differential treatment. As such, it is meaningful as a mode of privacy failure, both in the GDPR context and otherwise.
While we believe that PSO security is relevant for the GDPR as a necessary property of techniques that anonymize personal data, we do not consider it sufficiently protective by itself. First, singling out is only one mode of privacy failure. Many other failure modes have been considered, including reconstruction attacks, membership attacks, inference, and linkage. Second, our definition considers a setting where the underlying data are drawn from some (unknown) underlying distribution, an assumption that is not true in many real-life contexts. In such contexts, PSO security may not prevent singling out under the GDPR. Lastly, the incomposability of PSO security renders it inadequate in isolation and suggests that it should be complemented by other privacy requirements.
3. Notation
A dataset consists of elements taken from the data universe . We consider datasets where each entry is independently sampled from a fixed probability distribution over . We denote by a random variable sampled uniformly at random from .
A mechanism is a Turing machine that takes as input a dataset . Mechanisms may be randomized and interactive.
A predicate is a binary-valued function . We define the weight of a predicate to be . For a dataset , let . We occasionally use indicator notation to define a predicate; for example, equals 1 if and 0 otherwise.
For the purposes of asymptotic analyses, we use the number of rows in a dataset as the complexity parameter. We omit the dependence of on n.∥A function is negligible, denoted , if for all , there exists such that for all . Informally, this means that approaches 0 faster than any inverse polynomial function for large enough .
Many of our results apply to all data distributions with sufficient “min-entropy,” a measure of randomness useful for cryptography. The min-entropy of a probability distribution over a universe is . The relevant results apply to all distributions with even a moderate amount of min-entropy: for some (e.g., for ). We will call such distributions “min-entropic.”
Fact 3.1. For any min-entropic and any weight , there exist predicates , and a negligible function such that and [using the Leftover Hash Lemma (16)].
4. Security Against Predicate Singling Out
Consider a data controller who has in its possession a dataset consisting of rows sampled independently from a probability distribution . We think of the dataset as containing the personal data of individuals, one per row. The data controller publishes the output of an anonymization mechanism applied to the dataset . A predicate singling out adversary is a nonuniform probabilistic Turing machine with access to the published output and produces a predicate . We abuse notation and write , regardless of whether is an interactive or noninteractive mechanism. For now, we assume that all adversaries have complete knowledge of and are computationally unbounded**
The Article 29 Working Party describes singling out as “isolat[ing] some or all records which identify an individual in [a] dataset” (5). This is done by “narrowing down [to a singleton] the group to which [the individual] belongs” by specifying “criteria which allows him to be recognized” (7). We call this row isolation.
Definition 4.1 (Row Isolation): A predicate isolates a row in a dataset if there exists a unique such that . That is, if . We denote this event .
It is tempting to require that an anonymization mechanism only allows an adversary to isolate a row with negligible probability, but this intuition is problematic. An adversary that does not have access to —a trivial adversary—can output a predicate with (by Fact 3.1) and thereby isolate a row in with probability
In Bounding the Baseline, we will see that, in many cases, the trivial adversary need not know the distribution to produce such a predicate.
A. Security Against Predicate Singling Out.
We first define a measure of an adversary ’s success probability in isolating a row given the output of a mechanism while being restricted to output a predicate of at most a given weight .
Definition 4.2 (Adversarial Success Probability): Let be a distribution over . For mechanism , an adversary , a dataset size , and a weight let
Instead of considering the absolute probability that an adversary isolates a row, we consider the increase in probability relative to a baseline risk: the probability of isolation by a trivial adversary, .
Definition 4.3 (Trivial Adversary): A predicate singling out adversary is trivial if the distribution over outputs of is independent of . That is, .
Definition 4.4 (Baseline): For and weight ,
The baseline lets us refine the Working Party’s conception of singling out as row isolation. We require that no adversary should have significantly higher probability of isolating a row than a trivial adversary, when both output predicates of weight less than .
Definition 4.5 (Security Against Predicate Singling Out): For , , , we say a mechanism is -secure against predicate singling out (-PSO secure) if for all , , , and :
We often omit explicit reference to the parameter for , , , and to the distribution when it is clear from context.
The definition is strengthened as and get smaller. As shown next, when . This is the most important regime of Definition 4.5, as such predicates not only isolate a row in the dataset, but likely also isolate an individual in the entire population.
We say a mechanism is PSO secure if for all there exists such that is -PSO secure. Observe that for all , , and , -PSO security implies PSO security.††
B. Bounding the Baseline.
In this section, we characterize the baseline over intervals in terms of a simple function . For and a predicate of weight , the probability over that isolates a row in is
is maximized at and strictly decreases moving away from the maximum. approaches as and does so from above (recall that even for relatively small values of ).
Claim 4.1. For all , , and , . For min-entropic , .
Proof: Observe that for any randomized trivial adversary , there exists a deterministic trivial adversary such that . Therefore, without loss of generality, one can assume that the trivial adversary that achieves the baseline success probability is deterministic.
A deterministic always outputs a predicate of weight .
Therefore,
where the last inequality follows from the monotonicity of in the range . By Fact 3.1, there exists such that . Hence, .
Examples: For a (possibly randomized) function , consider the mechanism that on input outputs . Whether is PSO secure depends on the choice of :
-
•
Identity function: If , then and every distinct can be isolated by a predicate . If (e.g., when is uniform over or is min-entropic), then . Hence, is not PSO secure.
-
•
Pseudonymization: If is one-to-one and public, it offers no more protection than the identity function. For unique and , the predicate isolates . If is min-entropic, . Observe that is not PSO secure, even if is not efficiently computable. Furthermore, being many-to-one does not guarantee PSO security. For instance, suppose the data are uniform over and outputs the last bits of an input . is not PSO secure. In fact, it is possible to single out every row using the same predicates as above. Together, these observations challenge the use of some forms of pseudonymization.
-
•
Random function: If is a secret, random function, then carries no information about and provides no benefit to the adversary over a trivial adversary. For every , a trivial adversary can perfectly simulate any adversary by executing on a random input. Hence, is PSO secure.
C. Reflections on Modeling.
In many ways, Definition 4.5 demands the sort of high level of protection typical in the foundations of cryptography. It requires a mechanism to provide security for all distributions and against nonuniform, computationally unbounded adversaries.‡‡ The main weakness in the required protection is that it considers only data that are independent and identically distributed (i.i.d.), whereas real-life data cannot generally be modeled as i.i.d.
Any mechanism that purports to be a universal anonymizer of data under the GDPR—by transforming personal data into nonpersonal data—must prevent singling out. Our definition is intended to capture a necessary condition for a mechanism to be considered as rendering data sufficiently anonymized under the GDPR. Any mechanism that prevents singling out in all cases must prevent it in the special case that the data are i.i.d. from a distribution and for . We view a failure to provide security against predicate singling out (Definition 4.5) as strong evidence that a mechanism does not prevent singling out as conceived of by the GDPR.
On the other hand, satisfying Definition 4.5 is not sufficient for arguing that a mechanism renders data anonymized under the GDPR. Singling out is only one of the many “means reasonably likely to be used” (4) to identify a person in a data release. Furthermore, the definition considers only i.i.d. data; it may not even imply that a mechanism prevents singling out in other relevant circumstances.
It is important that our definitions are parameterized by the weight . An unrestricted trivial adversary can isolate a row with probability of about by outputting a predicate of weight about . If 37% was used as the general baseline (without consideration of the weight of the predicate), then the definition would permit an attacker to learn very specific information about individuals in a dataset, as long as it does so with probability less than . For instance, such a definition would permit a mechanism that published a row from the dataset with a one in three chance.
Remark 4.2. The baseline is also negligible when the trivial adversary is required to output predicates with weight at least . It is not clear to the authors how beneficial finding a predicate in this regime may be to an attacker. This high-weight regime is analyzed analogously in ref. 9.
5. Properties of PSO Security
Two desirable properties of privacy concepts are 1) immunity to postprocessing (i.e., further processing of the outcome of a mechanism, without access to the data, should not increase privacy risks), and 2) closure under composition (i.e., a combination of two or more mechanisms which satisfy the requirements of the privacy concept is a mechanism that also satisfies the requirements, potentially with worse parameters). It is easy to see that PSO security withstands postprocessing. However, it does not withstand composition, and we give two demonstrations for this fact.
First, we consider mechanisms which count the number of dataset rows satisfying a property and show that every such mechanism is PSO secure. However, there exists a collection of counts which allows an adversary to isolate a row with probability arbitrarily close to one using a predicate with negligible weight. Second, we construct a (less natural) pair of mechanisms that are individually PSO secure, but together allow the recovery of a row in the dataset. This construction borrows ideas from ref. 17.
Not being closed under composition is a significant weakness of the notion of PSO security. Our constructions rely on very simple mechanisms that we expect would be considered as preventing singling-out attacks (as a legal matter), even under other formulations of the concept. As such, it may well be that nonclosure under composition is an inherent property of the concept of singling out.
As a policy matter, we believe that closure under composition (as well as immunity to postprocessing) should be considered prerequisites for any privacy concept deemed sufficient to protect individuals’ sensitive data. Pragmatically, the fact that PSO security is not closed under composition suggests that this concept is best used for disqualifying privacy technology (i.e., if it is not PSO secure). This concept should not be used alone to certify or approve the use of any technology.
A. A PSO-Secure Counting Mechanism.
For any predicate , we define the corresponding Counting Mechanism:
Algorithm 1: Counting Mechanism : |
input: |
return |
For example, consider the least-significant bit predicate , that takes as input a string and outputs the first bit . The corresponding Counting Mechanism returns the sum of the first column of .
is PSO secure for any predicate . This is a corollary of the following proposition.
Proposition 5.1. For all , , , and , .
Proof: We define a trivial adversary such that for all , . The proposition follows by the definition of . samples a random and returns . For all datasets , there exists such that
For all , Therefore,
As exact counts are not differentially private, the PSO security of demonstrates that differential privacy (Definition 6.1) is not necessary for PSO security. The security of a single exact count easily extends to -many counts (even adaptively chosen), as the size of the codomain grows polynomially.
B. Failure to Compose.
Our next theorem states that a fixed set of counts suffices to predicate single out with probability close to . For a collection of predicates , let . Let be the uniform distribution over for some .
Theorem 5.2. For any , there exists a and such that
Choosing yields .
Proof: Treating as a binary number in , let . Observe that .
For , define the predicate , and let . Observe that if it happens that isolates row of , then . Consider the deterministic adversary that on input outputs the predicate
Observe that and that by construction . Thus,
Observe that and that by construction . Thus,
Remark 5.3. When the attack succeeds, all of the predicates match 0 or 1 rows in . It may seem that an easy way to counter the attack is by masking low counts, a common measure taken, e.g., in contingency tables. However, it is easy to modify the attack to only use predicates matching rows using one extra query. This means that restricting the mechanism to suppress low counts cannot prevent this type of attack. Let be a predicate with (e.g., parity of the bits), and let . The attack succeeds whenever . If and are independent, then this occurs with probability at least . As before, the probability can be amplified to .
While a single count is PSO secure for any data distribution, the above attack against counts applies only to the uniform distribution . We can extend the attack to general min-entropic distributions at the cost of randomizing the attacked mechanism (i.e., the set of predicates ). Furthermore, for min-entropic , this probability can be amplified to by repetition (9).
C. Failure to Compose Twice.
Borrowing ideas from refs. 17 and 18, we construct two mechanisms and which are individually PSO secure (for arbitrary distributions), but which together allow an adversary to single out with high probability when the data are uniformly distributed over the universe . With more work, this composition attack can be extended to more general universes and to min-entropic distributions.
Theorem 5.4. and described below are PSO secure. For and , , and the uniform distribution over , there exists an adversary such that
where .
We divide the input dataset into two parts. We treat as a “source of randomness” and as a “messsage.” outputs an encryption secret key based on the rows in , using the von Neumann extractor as described in Algorithm 2.
runs . If , it outputs (using as a one-time pad to encrypt the first bits of ); otherwise, it outputs . Alone, neither nor allows the adversary to single out, but using both, an adversary can recover and thereby predicate single out the last row.
Proof Outline: We must prove that and are PSO secure and that is not. Note that the security of and do not follow merely from the fact that their outputs are nearly uniform.§§
is -PSO secure.
Consider the mechanism that samples a random permutation and outputs . For any , is uniform conditioned on not outputting . Its security is equivalent to a mechanism outputting a single bit of information, which itself is -PSO secure by Proposition 5.1 (and therefore also PSO-secure by the observation at the end of Security Against PSO). To complete the proof, one can show that .
is PSO-secure.
We separately consider the two possible values of , where is the predicate returned by . Let , where
The output of is deterministic and information-theoretically independent of . Thus, for any ,
If singles out and , then it is effectively singling out against the subdataset . That is,
We construct that tries to single out against mechanism using . On input , samples and runs .
By the PSO security of , is negligible.
Insecurity of for .
The output of is a pair . If , aborts. By a Chernoff Bound, for , . If , recovers and outputs the predicate
By the choice of , . . The bound on follows, completing the proof of the claim and the theorem.
D. Singling Out and Failure to Compose.
The failure to compose demonstrated in Theorem 5.2 capitalizes on the use of multiple counting queries. Such queries underlie a large variety of statistical analyses and machine learning algorithms. We expect that other attempts to formalize security against singling out would also allow counting queries. If so, our negative composition results may extend to other formalizations.
The failure to compose demonstrated in Theorem 5.4 is more contrived. We expect that other attempts to formalize security against singling out would allow mechanisms like —ones where for every input randomly permuting the input and applying the mechanism results in the uniform distribution over outputs (as in the proof of Theorem 5.4). It is less clear to us whether other possible formalizations of security against singling out would allow a mechanism like . If it is to compose, it likely must not.
6. Differential Privacy Provides PSO Security
In this section, we demonstrate that differential privacy implies PSO security. Because exact counts are not differentially private but are PSO secure (Proposition 5.1), we have already shown that PSO security does not imply differential privacy.
Recall that we model as containing personal data of distinct individuals. For , we write if and differ on the data of exactly one individual .
Definition 6.1 [Differential Privacy (10, 19)] A randomized mechanism is -differentially private if for all and for all events ,
where the probability is taken over the randomness of the mechanism .
Our analysis relating PSO security to differential privacy is through a connection of both concepts to statistical generalization. For differential privacy, this connection was established in refs. 13 and 14. We use a variant of the latter from ref. 20:
Lemma 6.1 (Generalization Lemma). For all distributions and for all -differentially private algorithms operating on a dataset and outputting a predicate
Theorem 6.2. For all , , and , if is -differentially private, then is -PSO secure for
In particular, for , .
Proof: For simplicity of exposition, we present the proof for min-entropic distributions . The proof for general distributions follows from a similar argument.
Given , , and , define the predicate :
Observe that . The predicate can be computed from , , and without further access to . Because differential privacy is closed under postprocessing, if is -differentially private, then the computation that produces is -differentially private as well. Recall .
7. Does -Anonymity Provide PSO Security?
-anonymity (11, 12) is a strategy intended to help a data holder “release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful” (12). It is achieved by making each individual in a data release indistinguishable from at least individuals along certain attributes. Typically, a -anonymized dataset is produced by subjecting it to a sequence of generalization and suppression operations.
In this section, we analyze the extent to which -anonymity provides PSO security. We show that a -anonymized dataset typically provides an attacker information sufficient to PSO with constant probability. This result challenges the determination of the Article 29 Working Party.¶¶
A. Preliminaries.
Let be attribute domains. A dataset is a collection of rows , where . For subsets , we view as a set in the natural way, writing if , . We say that a dataset is derived from by generalization and suppression if , . For example, if correspond to “5-Digit ZIP Code,” “Gender,” and “Year of Birth,” then it may be that and , where denotes a suppressed value.
-anonymity aims to capture a sort of anonymity in a (very small) crowd: A data release is -anonymous if any individual row in the release is indistinguishable from other individual rows. Let be the number of rows in which agree with .***
Definition 7.1 (k-Anonymity [Rephrased from Ref. 12]): For , a dataset is -anonymous if for all . An algorithm is called a -anonymizer if on an input dataset its output is a -anonymous which is derived from by generalization and suppression.
For -anonymous dataset , let be the predicate that returns 1 if could have been derived from by generalization and suppression. Let . We assume for simplicity that -anonymizers always output such that , but the theorem generalizes to other settings.
B. k-Anonymity Enables Predicate Singling Out.
Before presenting the main theorem of this section, we provide an example of a very simple -anonymizer that fails to provide security against predicate singling out. Let be the uniform distribution over .
Consider the -anonymizer that processes groups of rows in index order and suppresses all bit locations where any of the rows disagree. Namely, for each group of rows , it outputs copies of the string , where if (i.e., all of the rows in the group have as their th bit) and otherwise.
The predicate evaluates to 1 if for all and evaluates to 0 otherwise. Namely, checks whether agrees with (and, hence, with all of ) on all nonsuppressed bits.
In expectation, positions of are not suppressed. For large enough , with high probability over the choice of , at least positions in are not suppressed. In this case, which is for any constant .
We now show how can be used adversarially. In expectation, positions of are suppressed. For large enough , with high probability over the choice of at least of the positions in are suppressed. Denote these positions . Define the predicate that evaluates to 1 if the binary number resulting from concatenating is greater than and 0 otherwise. Note that and, hence, isolates within group 1 with probability , as was the case with the trivial adversary described after Definition 4.1.
An attacker observing can now define a predicate . By the analysis above, is negligible (as it is bounded by ) and isolates a row in with probability . Hence, the -anonymizer of this example fails to protect against singling out.
Theorem 7.1 captures the intuition from this example and generalizes it, demonstrating that -anonymity does not typically protect against predicate singling out.
Theorem 7.1. For any , there exists an algorithm such that for all min-entropic , all -anonymizers , and all :
To predicate single out, the must output a predicate that both isolates and has low weight. The theorem states that these two requirements essentially decompose: is the probability that the predicate -anonymizer induces a low-weight predicate , and is the probability that a trivial adversary predicate singles out the subdataset of size . Algorithms for -anonymity generally try to preserve as much information in the dataset as possible. Thus, we expect such algorithms to typically yield low-weight predicates and correspondingly high values of .
Proof Outline: will construct some predicate and output the conjunction . Noting that , and that ,
[1] |
It remains only to show that for min-entropic distributions , . This claim is reminiscent of Fact 3.1, but with an additional challenge. The rows in are not distributed according to ; instead, they are a function of and the whole dataset . They are not independently distributed, and even their marginal distributions may be different from . Nevertheless, for the purposes of the baseline, the rows in have enough conditional min-entropy to behave like random rows.
Data Availability
This paper does not include original data.
Acknowledgments
We thank Uri Stemmer, Adam Sealfon, and anonymous reviewers for their helpful technical suggestions. This material is based upon work supported by the US Census Bureau under Cooperative Agreement CB16ADR0160001 and by NSF Grant CNS-1413920. A.C. was supported by the 2018 Facebook Fellowship and Massachusetts Institute of Technology’s (MIT’s) RSA Professorship and Fintech Initiative. This work was done while A.C. was at MIT, in part while visiting Georgetown University, and while both authors visited the Simons Institute at the University of California, Berkeley. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of funders.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission.
*Title 13 of the US Code mandates the role of the US Census.
†This point is emphasized in Recital 26: “The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
‡The Article 29 Data Protection Working Party was established by the EU Data Protection Directive and issued guidance on the meaning of the Directive. The opinions we consider in this article have not been officially updated since GDPR’s passing. Moreover, GDPR tacitly endorses the Working Party’s guidance on singling out: It borrows the language of Recital 26 almost verbatim from the Data Protection Directive, but adds the phrase “such as singling out.”
§The notion of “singling out” is not defined in the Opinion on the Concept of Personal Data (7). It is used in ref. 7 four times, each consistent with the above interpretation. Our interpretation coincides with and was initially inspired by that of ref. 8, defining “singling out as occurring when an analyst correctly makes a statement of the form ‘There is exactly one user that has these attributes.’”
¶For completeness, one can also consider predicates of weight , where the baseline is also negligible. See Remark 4.2.
∥More formally, we can consider an ensemble of data domains and an ensemble of distributions , where each is a distribution over .
**As is typical in cryptography, strengthening the adversary to be nonuniform (including possibly having full knowledge of the distribution ) yields stronger security definition. See Reflections on Modeling, where we reexamine these choices.
††For any , let .
‡‡It is reasonable to limit the adversary in Definition 4.5 to polynomial time. If we restricted our attention to distributions with moderate min-entropy, the results in this paper would remain qualitatively the same: Our trivial adversaries and lower bounds are all based on efficient and uniform algorithms; our upper bounds are against unbounded adversaries. Relatedly, restricting to min-entropy distributions would allow us to switch the order of quantifiers of and in the definition of the baseline without affecting our qualitative results.
§§For example, the mechanism that outputs may be uniform, but it trivially allows singling out. Security would follow if the output was nearly uniform conditioned on , but does not satisfy this extra property.
¶¶Our results hold equally for -diversity (21) and -closeness (22), which the Article 29 Working Party also concludes prevent singling out.
***Often is paramaterized by a subset of the attribute domains called a quasi-identifier. This parameterization does not affect our analysis, and we omit it for simplicity.
References
- 1.Ohm P., Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701–1777 (2010). [Google Scholar]
- 2.Dwork C., McSherry F., Nissim K., Smith A., Calibrating noise to sensitivity in private data analysis. J. Priv. Confid. 7, 17–51 (2017). [Google Scholar]
- 3.Nissim K., Wood A., Is privacy privacy? Philos. Trans. R. Soc. A. 376, 20170358 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.European Parliamentand the Council of the European Union , Regulation (EU) 2016/679, 206. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679. Accessed 1 April 2019.
- 5.Article 29 Data Protection Working Party, Opinion 05/2014 on anonymisation techniques. https://iapp.org/media/pdf/resource_center/wp136_concept-of-personal-data_06-2007.pdf. Accessed 1 April 2019.
- 6.Nissim K., et al. , Bridging the gap between computer science and legal approaches to privacy. Harv. J. Law Technol. 31, 687–780 (2018). [Google Scholar]
- 7.Article 29 Data Protection Working Party, Opinion 04/2007 on the concept of personal data. https://iapp.org/media/pdf/resource_center/wp216_Anonymisation-Techniques_04-2014.pdf (2007). Accessed 1 April 2019.
- 8.Francis P., et al. , Extended Diffix-Aspen. arXiv:1806.02075 (6 June 2018).
- 9.Cohen A., Nissim K., Towards formalizing the GDPR’s notion of singling out. arXiv:1904.06009 (12 April 2019).
- 10.Dwork C., McSherry F., Nissim K., Smith A., “Calibrating noise to sensitivity in private data analysis” in Third Theory of Cryptography Conference, Halevi S., Rabin T., Eds. (Lecture Notes in Computer Science, Springer, Berlin, Germany, 2006), vol. 3876, pp. 265–284. [Google Scholar]
- 11.Samarati P., Sweeney L., “Generalizing data to provide anonymity when disclosing information” in Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Mendelzon A. O., Paredaens J., Eds. (ACM Press, New York, NY, 1998), p. 188. [Google Scholar]
- 12.Sweeney L., k-anonymity: A model for protecting privacy. Internat. J. Uncertainty Fuzziness Knowledge-Based Syst. 10, 557–570 (2002). [Google Scholar]
- 13.Dwork C., et al. , “Preserving statistical validity in adaptive data analysis” in Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Servedio R. A., Rubinfeld R., Eds. (ACM, New York, NY, 2015), pp. 117–126. [Google Scholar]
- 14.Bassily R., et al. , “Algorithmic stability for adaptive data analysis” in Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Wichs D., Mansour Y., Eds. (ACM, New York, NY, 2016), pp. 1046–1059. [Google Scholar]
- 15.Narayanan A., Shmatikov V., “Robust de-anonymization of large sparse datasets” in IEEE Symposium on Security and Privacy, 2008, SP 2008 (IEEE, Piscataway, NJ, 2008), pp. 111–125. [Google Scholar]
- 16.Dodis Y., Ostrovsky R., Reyzin L., Smith A., Fuzzy extractors: How to generate strong keys from biometrics and other noisy data. SIAM J. Comput. 38, 97–139 (2008). [Google Scholar]
- 17.Nissim K., Smith A. D., Steinke T., Stemmer U., Ullman J., “The limits of post-selection generalization” in Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Bengio S., Wallach H. M., Larochelle H., Grauman K., Cesa-Bianchi N., Eds. (Curran Associates Inc, Red Hook, NY, 2008), vol. 31, pp. 6402–6411. [Google Scholar]
- 18.Dwork C., Naor M., On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. J. Priv. Confid. 2, 93–107 (2010). [Google Scholar]
- 19.Dwork C., Kenthapadi K., McSherry F., Mironov I., Naor M., “Our data, ourselves: Privacy via distributed noise generation” in Annual International Conference on the Theory and Applications of Cryptographic Techniques (Springer, Cham, Switzerland, 2006), pp. 486–503. [Google Scholar]
- 20.Nissim K., Stemmer U., On the generalization properties of differential privacy. arXiv:1504.05800 (22 April 2015).
- 21.Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M., “L-diversity: Privacy beyond k-anonymity” in 22nd International Conference on Data Engineering (ICDE’06) (IEEE Computer Society, Washington, DC, 2007), p. 1. [Google Scholar]
- 22.Li N., Li T., Venkatasubramanian S., “t-closeness: Privacy beyond k-anonymity and l-diversity” in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, Chirkova R., Dogac A., Özsu M. T., Sellis T. K., Eds. (IEEE Computer Society, Washington, DC, 2007), pp. 106–115. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
This paper does not include original data.