Sample Complexity Bounds for Differentially Private Learning

Kamalika Chaudhuri; Daniel Hsu

. Author manuscript; available in PMC: 2014 Oct 2.

Published in final edited form as: JMLR Workshop Conf Proc. 2011;2011:155–186.

Sample Complexity Bounds for Differentially Private Learning

Kamalika Chaudhuri ¹, Daniel Hsu ²

PMCID: PMC4183222 NIHMSID: NIHMS464313 PMID: 25285183

Abstract

This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years.

A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008).

We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution Inline graphic chosen independently of the sensitive data. Given such a reference , we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy – namely, that the only labels (and not the unlabeled parts of the examples) be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.

Keywords: Privacy, generalization, PAC-learning

1. Introduction

As increasing amounts of personal data is collected, stored and mined by companies and government agencies, the question of how to learn from sensitive datasets while still maintaining the privacy of individuals in the data has become very important. Over the last few years, the notion of differential privacy (Dwork et al., 2006) has received a significant amount of attention, and has become the de facto standard for privacy-preserving computation. In this paper, we study the problem of learning a classifier from a dataset, while simultaneously guaranteeing differential privacy of the training data.

The key issue in differentially-private computation is that given a certain amount of resources, there is usually a tradeoff between privacy and utility. In classification, a natural measure of utility is the classification accuracy, and data is a scarce resource. Thus, a key question in differentially-private learning is: how many examples does a learning algorithm need to guarantee a certain level of privacy and accuracy? In this paper, we study this question from an information-theoretic perspective – namely, we are concerned with the sample complexity, and not the computational complexity of the learner.

This question was first considered by Kasiviswanathan et al. (2008), who studied the case of finite hypothesis classes, as well as the case of discrete data domains. They showed that in these two cases, one can obtain any given privacy guarantee and generalization error, regardless of the unlabeled data distribution with a modest increase in the worst-case sample requirement.

In this paper, we consider the sample complexity of differentially private learning in the context of infinite hypothesis classes on continuous data distributions. This is a very general class of learning problems, and includes many popular machine-learning tasks such as learning linear classifiers when the examples have real-valued features, which cannot be modeled by finite hypothesis classes or hypothesis classes over discrete data domains.

Surprisingly, we show that the results of Kasiviswanathan et al. (2008) do not extend to infinite hypothesis classes on continuous data distributions. As an example, consider the class of thresholds on the unit interval. This simple learning problem has VC dimension 1, and thus for all unlabeled data distributions, it can be learnt (non-privately) with error ε given at most $\tilde{O} (\frac{1}{ε})$ examples¹. We show that even for this very simple hypothesis class, any algorithm that uses a bounded number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions.

The key intuition behind our proof is that if most of the unlabeled data is concentrated in a small region around the best classifier, then, even slightly perturbing the best classifier will result in a large classification error. As the process of ensuring differential privacy necessarily involves some perturbation – see, for example, Dwork et al. (2006), unless the algorithm has some prior public knowledge about the data distribution, the number of samples required to learn privately grows with growing concentration of the data around the best classifier.

How can we then learn privately in infinite hypothesis classes over continuous data distributions? One approach is to use some prior information about the data distribution that is known independently of the sensitive data. Another approach is to relax the privacy requirements. In this paper, we examine both approaches.

First, we consider the case when the learner has access to some prior information on the unlabeled data. In particular, the learner knows a reference distribution Inline graphic that is close to the unlabeled data distribution. Similar assumptions are common in Bayesian learning, and PAC-Bayes style bounds have also been studied in the learning theory literature, for example, by McAllester (1998).

Under this assumption, we provide an algorithm for learning with α-privacy, excess generalization error ε, and confidence 1 − δ, using $\tilde{O} (d_{U} log (κ / ε) (\frac{1}{ε^{2}} + \frac{1}{ε α}))$ samples. Here α is a privacy parameter (where, lower α implies a stronger privacy guarantee), Inline graphic is the reference distribution, d is the doubling dimension of its disagreement metric (Bshouty et al., 2009), and κ is a smoothness parameter that we define. The quantity d measures the complexity of the hypothesis class with respect to (see (Bshouty et al., 2009) for a discussion), and we assume that it is finite. The smoothness parameter measures how close the unlabeled data distribution is to Inline graphic (smaller κ means closer), and is motivated by notions of closeness used in Dasgupta (2005) and Freund et al. (1997). Thus the sample requirement of our algorithm grows with increasing distance between and the unlabeled data distribution. Our algorithm works in the non-realizable case, that is, when no hypothesis in the class has zero error; using standard techniques, a slightly better bound of $\tilde{O} (\frac{d_{U} log (κ / ε)}{ε α})$ can be obtained in the realizable setting. However, like the results of Kasiviswanathan et al. (2008), our algorithm is computationally inefficient in general.

The main difficulty in reducing the differentially-private learning algorithms of Kasiviswanathan et al. (2008) to infinite hypothesis classes on continuous data distributions is in finding a suitable finite cover of the class with respect to the unlabeled data. This issue is specific to our particular problem: for non-private learning, a finite cover can always be computed based on the (sensitive) data, and for finite hypothesis classes, the entire class is a cover. The main insight behind our upper bound is that when the unlabeled distribution Inline graphic is close to the reference distribution , then a cover of is also a possibly coarser cover of . Since one can compute a private cover of independent of the sensitive data, we simply compute a finer cover of , and learn over this fine cover using standard techniques such as the exponential mechanism (McSherry and Talwar, 2007).

Next we relax the privacy requirement by requiring only label privacy. In other words, we assume that the unlabeled part of the examples are not sensitive, and the only private information is the labels. This setting was considered by Chaudhuri et al. (2006). An example when this may be applicable is in predicting income from public demographic information. Here, while the label (income) is private, the demographic information of individuals, such as education, gender, and age may be public.

In this case, we provide lower bounds to characterize the sample requirement of label-private learning. We show two results, based on the value of α and ε. For small ε and α (that is, for high privacy and accuracy) we show that any learning algorithm for a given hypothesis class that guarantees α-label privacy and ε accuracy necessarily requires at least $Ω (\frac{d}{α ε})$ examples. Here d is the doubling dimension of the disagreement metric at a certain scale, and is a measure of the complexity of the hypothesis class on the unlabeled data distribution. This bound holds when the hypothesis class has finite VC dimension. For larger α and ε, our bounds are weaker but more general; we show a lower bound of $Ω (\frac{d^{'}}{α})$ on the sample requirement that holds for any α and ε, and do not require the VC dimension of the hypothesis class to be finite. Here d′ is the doubling dimension of the disagreement metric at a certain scale.

The main idea behind our stronger label privacy lower bounds is to show that differentially private learning algorithms necessarily perform poorly when there is a large set of hypotheses such that every pair in the set labels approximately 1/α examples differently. We then show that such large sets can be constructed when the doubling dimension of the disagreement metric of the hypothesis class with respect to the data distribution is high.

How do these results fit into the context of non-private learning? For non-private learning, sample requirement bounds based on the doubling dimension of the disagreement metric has been extensively studied by (Bshouty et al., 2009); in the realizable case, they show an upper bound of $\tilde{O} (\frac{\bar{d}}{ε})$ for learning with accuracy ε, where d̄ is again the doubling dimension of the disagreement metric at a certain scale. These bounds are incomparable to ours in general, as the doubling dimensions in the two bounds are with respect to different scales; however, we can compare them for hypothesis classes and data distributions for which the doubling dimension of the disagreement metric is equal at all scales. An example is learning half spaces with respect to the uniform distribution on the sphere. For such problems, on the upper bound side, we need a factor of $O (\frac{log (κ / ε)}{α})$ times more examples to learn with α-privacy. On the other hand, our lower bounds indicate that for small α and ε, even if we only want α-label privacy, the sample requirement can be as much as a factor of $Ω (\frac{1}{α})$ more than the upper bound for non-private learning.

Finally, one may be tempted to think that we can always discretize a data domain or a hypothesis class, and therefore in practice we are likely to only learn finite hypothesis classes or over discrete data domains. However, there are several issues with such discretization. First, if we discretize either the hypothesis class or the data, then the sample requirement of differentially private learning algorithms will grow as the discretization grows finer, instead of depending on intrinsic properties of the problem. Second, as our α-privacy lower bound example shows, indiscriminate discretization without prior knowledge of the data can drastically degrade the performance of the best classifier in a class. Finally, infinite hypothesis classes and continuous data domains provide a natural abstraction for designing many machine learning algorithms, such as those based on convex optimization or differential geometry. Understanding the limitations of differentially private learning on such hypothesis classes and data domains is useful in designing differentially private approximations to these algorithms.

The rest of our paper is organized as follows. In Section 2, we define some preliminary notation, and explain our privacy model. In Section 3, we present our α-privacy lower bound. Our α-privacy upper bound is provided in Section 4. In Section 5, we provide some lower bounds on the sample requirement of learning with α-label privacy. Finally, the proofs of most of our results are in the appendix.

1.1. Related work

The work most related to ours is Kasiviswanathan et al. (2008), Blum et al. (2008) and Beimel et al. (2010), each of which deals with either finite hypothesis classes or discrete data domains.

Kasiviswanathan et al. (2008) initiated the study of the sample requirement of differentially-private learning. They provided a (computationally inefficient) α-private algorithm that learns any finite hypothesis class Inline graphic with error at most ε using at most $\tilde{O} (\frac{log ∣ H ∣}{α ε})$ examples in the realizable case. For the non-realizable case, they provided an algorithm with a sample requirement of $\tilde{O} (log ∣ H ∣ \cdot (\frac{1}{α ε} + \frac{1}{ε^{2}}))$ . Moreover, using a result from Blum et al. (2008), they provided a computationally inefficient α-private algorithm that learns a hypothesis class with VC-dimension V and data dimension n with at most $\tilde{O} (\frac{n V}{α ε^{3}})$ examples, provided the data domain is {−1, 1}ⁿ. The latter result does not apply when the data is drawn from a continuous distribution; moreover, their results cannot be directly extended to the continuous case.

The first work to study lower bounds on the sample requirement of differentially private learning was Beimel et al. (2010). They show that any α-private algorithm that selects a hypothesis from a specific set C_ε requires at least Ω̃(log(|C_ε|)/α) samples to achieve error ε. Here C_ε is an ε-cover as well as an ε-packing of the hypothesis class Inline graphic with respect to every distribution over the discrete data domain. They also show an upper bound of Õ(log(|C_ε|)/(αε)). Such a cover C_ε does not exist for continuous data domains; as a result their upper bounds do not apply to our setting. Moreover, unlike our lower bounds, their lower bound only applies to algorithms of a specific form (namely, those that output a hypothesis in C_ε), and it also does not apply when we only require the labels to be private.

For the setting of label privacy, Chaudhuri et al. (2006) show an upper bound for PAC-learning in terms of the VC dimension of the hypothesis class. We show a result very similar to theirs in the appendix for completeness, and we show lower bounds for learning with label-privacy which indicate that their bounds are almost tight, in terms of the dependence on α and ε.

Zhou et al. (2009) study some issues in defining differential privacy when dealing with continuous outcomes; however, they do not consider the question of learning classifiers on such data.

Finally, a lot of our work uses tools from the theory of generalization bounds. In particular, some of our upper and lower bounds are inspired by Bshouty et al. (2009), which bounds the sample complexity of (non-private) classification in terms of the doubling dimension of the disagreement metric.

Other related work on privacy

The issue of privacy in data analysis of sensitive information has long been a source of problems for curators of such data, and much of this is due to the realization that many simple and intuitive mechanisms designed to protect privacy are simply ineffective. For instance, the work of Narayanan and Shmatikov (2008) showed that an anonymized dataset released by Netflix revealed enough information so that an adversary, by knowing just a few of the movies rated by a particular user, would be able to uniquely identify such a user in the data set and determine all of his movie ratings. Similar attacks have been demonstrated on private data in other domains as well including social networks (Backstrom et al., 2007) and search engine query logs (Jones et al., 2007). Even releasing coarse statistics without proper privacy safeguards can be problematic. This was recently shown by Wang et al. (2009) in the context of genetic data, where a correlation matrix of genetic markers compiled from a group of individuals contained enough clues to uniquely pinpoint individuals in the dataset and learn of their private information, such as whether or not they had certain diseases.

In order to reason about privacy guarantees (or lack thereof), we need a formal definition of what it means to preserve privacy. In our work, we adopt the notion of differential privacy due to Dwork et al. (2006), which has over the last few years gained much popularity. Differential privacy is known to be a very strong notion of privacy: it has strong semantic guarantees (Kasiviswanathan and Smith, 2008) and is resistant to attacks that many earlier privacy definitions are susceptible to (Ganta et al., 2008b).

There has been a significant amount of work on differential privacy applied to a wide variety of data analysis tasks (Dwork et al., 2006; Chaudhuri and Mishra, 2006; Nissim et al., 2007; Barak et al., 2007; McSherry and Mironov, 2009). Some work that is relevant to ours include Blum et al. (2008), which provides a general method for publishing datasets on discrete data domains while preserving differential privacy so that the answers to queries from a function class with bounded VC dimension will be approximately preserved after the applying the sanitization procedure. More work on this line includes Roth (2010) and Gupta et al. (2011). A number of learning algorithms have also been suitably modified to guarantee differential privacy. For instance, both the classes of statistical query algorithms and the class of methods based on L₂-regularized empirical risk minimization with certain types of convex losses can be made differentially private (Blum et al., 2005; Chaudhuri et al., 2011).

There has also been some prior work on providing lower bounds on the loss of accuracy that any differentially private mechanism would suffer; much of this work is in the context of releasing answers to some a of queries made on a database of n individuals. The first such work is by (Blum et al., 2008), which shows that no differentially private mechanism can hope to release with a certain amount of accuracy the answer to a number of median queries when the data lies on a real line. This result is similar in spirit to our Theorem 5, but applies to a much harder problem, namely data release. Other relevant work includes (Hardt and Talwar, 2010), which uses a packing argument similar to ours to provide a lower bound on the amount of noise any differentially private mechanism needs to add to the answer to k linear queries on a database of n people.

There has also been a significant amount of prior work on privacy-preserving data mining (Agrawal and Srikant, 2000; Evfimievski et al., 2003; Sweeney, 2002; Machanavajjhala et al., 2006), which spans several communities and uses privacy models other than differential privacy. Many of the models used have been shown to be susceptible to various attacks, such as composition attacks, where the adversary has some amount of prior knowledge (Ganta et al., 2008a). An alternative line of privacy work is in the Secure Multiparty Computation setting due to Yao (1982), where the sensitive data is split across several adversarial databases, and the goal is to compute a function on the union of these databases. This is in contrast with our setting, where a single centralized algorithm can access the entire dataset.

2. Preliminaries

2.1. Privacy model

We use the differential privacy model of Dwork et al. (2006). In this model, a private database DB ⊆ Inline graphic consists of m sensitive entries from a domain ; each entry in DB is a record about an individual (e.g., their medical history) that one wishes to keep private.

The database DB is accessed by users through a sanitizer M. The sanitizer, a randomized algorithm, is said to preserve differential privacy if the value of any one individual in the database does not significantly alter the output distribution of M.

Definition 1

A randomized mechanism M guarantees α-differential privacy if, for all databases DB₁ and DB₂ that differ by the value of at most one individual, and for every set G of possible outputs of M,

{Pr}_{M} [M ({DB}_{1}) \in G] \leq {Pr}_{M} [M ({DB}_{2}) \in G] \cdot e^{α} .

We emphasize that the probability in the definition above is only with respect to the internal randomization of the algorithm; it is independent of all other random sources, including any that may have generated the values of the input database.

Differential privacy is a strong notion of privacy (Dwork et al., 2006; Kasiviswanathan and Smith, 2008; Ganta et al., 2008b). In particular, if a sanitizer M ensures α-differential privacy, then, an adversary who knows the private values of all the individuals in the database except for one and has arbitrary prior knowledge about the value of the last individual, cannot gain additional confidence about the private value of the last individual by observing the output of a differentially private sanitizer. The level of privacy is controlled by α, where a lower value of α implies a stronger guarantee of privacy.

2.2. Learning model

We consider a standard probabilistic learning model for binary classification. Let Inline graphic be a distribution over × {±1}, where is the data domain and {±1} are the possible labels. We use to denote the marginal of over the data domain . The classification error of a hypothesis h: → {±1} with respect to a data distibution is

{Pr}_{(x, y) ~ P} [h (x) \neq y] .

We denote by S ~ Inline graphic an i.i.d. draw of m labeled examples S = {(x₁, y₁), …, (x_m, y_m)} ⊆ ×{±1} from the distribution . This process can equivalently be seen as drawing an unlabeled sample X:= {x₁, …, x_m} from the marginal , and then, for each x ∈ X, drawing the corresponding label y from the induced conditional distribution.

A learning algorithm is given as input a set of m labeled examples S ~ Inline graphic , a target accuracy parameter ε ∈ (0, 1), and target confidence parameter δ ∈ (0, 1). Its goal is to return a hypothesis h: → {±1} such that its excess generalization error with respect to a specified hypothesis class

{Pr}_{(x, y) ~ P} [h (x) \neq y] - inf_{h^{'} \in H} {Pr}_{(x, y) ~ P} [h^{'} (x) \neq y]

is at most ε with probability at least 1 − δ over the random choice of the sample S ~ Inline graphic , as well as any internal randomness of the algorithm.

We also occasionally adopt the realizable assumption (with respect to Inline graphic ). The realizable assumption states that there exists some h^* ∈ such that Pr₍_x,y_)~[h^*(x) ≠ y] = 0. In this case, the excess generalization error of a hypothesis h is simply its classification error. Without the realizable assumption, there may be no classifier in the hypothesis class Inline graphic with zero classification error, and we refer to this as the non-realizable case.

2.3. Privacy-preserving classification

In privacy-preserving classification, we assume that the database is a training dataset drawn in an i.i.d manner from some data distribution Inline graphic , and that the sanitization mechanism is a learning algorithm that outputs a classifier based on the training data. In this paper, we consider two possible privacy requirements on our learning algorithms.

Definition 2

A randomized learning algorithm Inline graphic guarantees α-label privacy ( is α-label private) if, for any two datasets S₁ = {(x₁, y₁), …, (x_m₋₁, y_m₋₁), (x_m, y_m)} and $S_{2} = {(x_{1}, y_{1}), \dots, (x_{m - 1}, y_{m - 1}), (x_{m}, y_{m}^{'})}$ differing in at most one label $y_{m}^{'}$ , and any set of outputs G of Inline graphic ,

{Pr}_{A} [A (S_{1}) \in G] \leq {Pr}_{A} [A (S_{2}) \in G] \cdot e^{α} .

Definition 3

A randomized learning algorithm Inline graphic guarantees α-privacy ( is α-private) if, for any two datasets S₁ = {(x₁, y₁), …, (x_m₋₁, y_m₋₁), (x_m, y_m)} and $S_{2} = {(x_{1}, y_{1}), \dots, (x_{m - 1}, y_{m - 1}), (x_{m}^{'}, y_{m}^{'})}$ differing in at most one example ( $x_{m}^{'}, y_{m}^{'}$ ), and any set of outputs G of Inline graphic ,

{Pr}_{A} [A (S_{1}) \in G] \leq {Pr}_{A} [A (S_{2}) \in G] \cdot e^{α} .

Note that if the input dataset S is a random variable, then for any value S′ ⊆ Inline graphic × {±1} in the range of S, the conditional probability distribution of (S) | S = S′ is determined only by the algorithm and the value S′; it is independent of the distribution of the random variable S. Therefore, for instance,

{Pr}_{S, A} [A (S) \in G ∣ S = S^{'}] = {Pr}_{A} [A (S^{'}) \in G] .

for any S′ ⊆ Inline graphic × {±1} and any set of outputs G.

The difference between the two notions of privacy is that for α-label privacy, the two databases can differ only in the label of one example; whereas for α-privacy, the two databases differ can differ in a complete example (both labeled and unlabeled parts). Thus, α-label privacy only ensures the privacy of the label component of each example; it makes no guarantees about the unlabeled part. If a classification algorithm guarantees α-privacy, then it also guarantees α-label privacy. Thus α-label privacy is a weaker notion of privacy than α-privacy.

The notion of label privacy was also considered by Chaudhuri et al. (2006), who provided an algorithm for learning with label privacy. For strict privacy, one would require the learning algorithm to guarantee α-privacy; however, label privacy may also be an useful notion. For example, if the data x represents public demographic information (e.g., age, zip code, education), while the label y represents income level, an individual may consider the label to be private but may not mind if others can infer her demographic information (which could be relatively public already) by her inclusion in the database.

Thus, the goal of a α-private (resp. α-label private) learning algorithm is as follows. Given a dataset S of size m, a privacy parameter α, a target accuracy ε, and a target confidence parameter δ:

guarantee α-privacy (resp. α-label privacy) of the training dataset S;
with probability at least 1 − δ over both the random choice of S ~ and the internal randomness of the algorithm, return a hypothesis h: → {±1} with excess generalization error
${Pr}_{(x, y) ~ P} [h (x) \neq y] - inf_{h^{'} \in H} {Pr}_{(x, y) ~ P} [h^{'} (x) \neq y] \leq ε .$

2.4. Additional definitions and notation

We now present some additional essential definitions and notation.

Metric spaces, doubling dimension, covers, and packings

A metric space ( Inline graphic , ρ) is a tuple, where is a set of elements, and ρ is a distance function from × to {0} ∪ ℝ⁺. Let ( , ρ) be any arbitrary metric space. For any z ∈ and r > 0, let B(z, r) = {z′ ∈ : ρ(z, z′) ≤ r} denote the ball centered at z of radius r.

The diameter of ( Inline graphic , ρ) is sup{ρ(z, z′): z, z′ ∈ }, the longest distance in the space. An ε-cover of ( , ρ) is a set C ⊆ such that for all z ∈ , there exists some z′ ∈ C such that ρ(z, z′) ≤ ε. An ε-packing of ( , ρ) is a set P ⊆ such that ρ(z, z′) > ε for all distinct z, z′ ∈ P. Let Inline graphic ( , ρ) denote the size of the smallest ε-cover of ( , ρ).

We define the doubling dimension of ( Inline graphic , ρ) at scale ε, denoted as ddim_ε( , ρ), as the smallest number d such that each ball B(z, ε) ⊆ of radius ε can be covered by at most ⌊2^d⌋ balls of radius ε/2, i.e. there exists z₁, …, z_⌊2^d⌋ ∈ such that B(z, ε) ⊆ B(z₁, ε/2) ∪ … ∪ B(z_⌊2^d⌋, ε/2). Notice that ddim_ε( Inline graphic , ρ) may increase or decrease with ε. The doubling dimension of ( , ρ) is sup{ddim_r( , ρ): r > 0}.

Disagreement metrics

The disagreement metric of a hypothesis class Inline graphic with respect to a data distribution over is the metric ( , ρ), where ρ is the following distance function:

ρ_{D} (h, h^{'}) : = {Pr}_{x ~ D} [h (x) \neq h^{'} (x)] .

The empirical disagreement metric of a hypothesis class Inline graphic with respect to a data distribution over is the metric ( , ρ_X), where ρ_X is the following distance function:

ρ_{X} (h, h^{'}) : = \frac{1}{∣ X ∣} \sum_{x \in X} I [h (x) \neq h^{'} (x)] .

The disagreement metric (or empirical disagreement metric) is the proportion of unlabeled examples on which h and h′ disagree with respect to Inline graphic (or the uniform distribution over X). We use the notation B (h, r) to denote the ball centered at h of radius r with respect to ρ, and B_X(h, r) to denote the ball centered at h of radius r with respect to ρ_X.

Datasets and empirical error

For an unlabeled dataset X ⊆ Inline graphic and a hypothesis h: → {±1}, we denote by S_X,h:= {(x, h(x)): x ∈ X} the labeled dataset induced by labeling X with h. The empirical error of a hypothesis h: → {±1} with respect to a labeled dataset S ⊆ × {±1} is err(h, S):= (1/|S|)Σ₍_x,y_{) ∈}_S [h(x) ≠ y] the average number of mistakes that h makes on S; note that ρ_X(h, S_X,h′) = err(h, S_X,h′). Finally, we informally use the Õ(·) notation to hide log(1/δ) factors, as well as factors that are logarithmic in those that do appear.

3. Lower bounds for learning with α-privacy

In this section, we show a lower bound on the sample requirement of learning with α-privacy. In particular, we show an example that illustrates that when the data is drawn from a continuous distribution, for any M, all α-private algorithms that are supplied with at most M examples fail to output a good classifier for at least one unlabeled data distribution.

Our example hypothesis class is the class of thresholds on [0, 1]. This simple class has VC dimension 1, and thus can be learnt non-privately with classification error ε given only Õ(1/ε) examples, regardless of the unlabeled data distribution. However, Theorem 5 shows that even in the realizable case, for every α-private algorithm that is given a bounded number of examples, there is at least one unlabeled data distribution on which the learning algorithm produces a classifier with error $\geq \frac{1}{5}$ , with probability at least 1/2 over its own random coins.

The key intuition behind our example is that if most of the unlabeled data is concentrated in a small region around the best classifier, then, even slightly perturbing the best classifier will result in a large classification error. As the process of ensuring differential privacy necessarily involves some perturbation, unless the algorithm has some prior public knowledge about the data distribution, the number of samples required to learn privately grows with growing concentration of the data around the best classifier. As illustrated by our theorem, this problem is not alleviated if the support of the unlabeled distribution is known; even if the data distribution has large support, a large fraction of the data can still lie in a region close to the best classifier.

Before we describe our example in detail, we first need a definition.

Definition 4

The class of thresholds on the unit interval is the class of functions h_w: [0, 1] → {−1, 1} such that:

h_{w} (x) : = {\begin{cases} 1 & i f x \geq w \\ - 1 & otherwise . \end{cases}

Theorem 5

Let M > 2 be any number, and let Inline graphic be the class of thresholds on the unit interval [0, 1]. For any α-private algorithm A that outputs a hypothesis h ∈ , there exists a distribution on labeled examples with the following properties:

There exists a threshold h^* ∈ with classification error 0 with respect to .
For all samples S of size m ≤ M drawn from , with probability at least 1/2 over the random coins of A, the hypothesis output by A(S) has classification error at least $\frac{1}{5}$ with respect to .
The marginal of over the unlabeled data has support [0, 1].

Proof

Let $η = \frac{1}{6 + 4 exp (α M)}$ , and let Inline graphic denote the uniform distribution over [0, 1]. Let Z = {η, 2η, …, Kη}, where K = ⌊1/η⌋ − 1. We let G_z = [z − η/3, z + η/3] for z ∈ Z, and let ⊂ be the subset of thresholds: = {h_τ|τ ∈ G_z}. We note that G_z ⊆ [0, 1] for all z ∈ Z.

For each z ∈ Z, we define a distribution Inline graphic over labeled examples as follows. First, we describe the marginal of over the unlabeled data. A sample from is drawn as follows. With probability $\frac{1}{2}$ , x is drawn from ; with probability $\frac{1}{2}$ , it is drawn from uniformly from G_z. Now, an unlabeled example x drawn from Inline graphic is labeled positive if x ≥ z, and negative otherwise. We observe that for every such distribution , there exists a threshold, namely, h_z that has classification error 0; in addition, the support of is [0, 1]. Moreover, there are $⌊ \frac{1}{η} ⌋ - 1$ such distributions Inline graphic in all, and $⌊ \frac{1}{η} ⌋ - 1 \geq 5$ .

We say that an α-private algorithm A succeeds on a sample S with respect to a distribution Inline graphic if with probability $\frac{1}{2}$ over the random coins of A, the hypothesis output by A(S) has classification error $< \frac{1}{5}$ over .

Suppose for the sake of contradiction that there exists an α-private algorithm A^* such that for all distributions Inline graphic , there is at least one sample S of size ≤ M drawn from such that A^* succeeds on S with respect to . Then, for all z ∈ Z, and for all , there exists a sample S_z of size m ≤ M drawn from such that A^* succeeds on S_z with respect to .

By construction, the G_z’s are disjoint, so

{Pr}_{A^{*}} [A^{*} (S_{z}) \notin G_{z}] \geq \sum_{z^{'} \in Z \ {z}} {Pr}_{A^{*}} [A^{*} (S_{z}) \in G_{z^{'}}] .

(1)

Furthermore, any S_z differs from S_z′ by at most m labeled examples, so because A^* is α-private, Lemma 22 that implies for any z′,

{Pr}_{A^{*}} [A^{*} (S_{z}) \in G_{z^{'}}] \geq e^{- α m} {Pr}_{A^{*}} [A^{*} (S_{z^{'}}) \in G_{z^{'}}] .

(2)

If A^*(S_z′) lies outside Inline graphic , A^*(S_z′) classifies at least 1/4 fraction of the examples from incorrectly, and thus A^* cannot succeed on S_z′ with respect to . Therefore, by the assumption on A^*, for any z′,

{Pr}_{A^{*}} [A^{*} (S_{z^{'}}) \in G_{z^{'}}] \geq \frac{1}{2} .

(3)

Combining Equations (1), (2), and (3) gives the inequality

{Pr}_{A^{*}} [A^{*} (S_{z}) \notin G_{z}] \geq e^{- α m} \cdot \sum_{z^{'} \in Z \ {z}} \frac{1}{2} \geq e^{- α m} \cdot (\frac{1}{η} - 2) \cdot \frac{1}{2} .

Since m ≤ M, the quantity on the RHS of the above equation is more than $\frac{2}{3}$ . A^* therefore does not succeed on S_z with respect to Inline graphic , thus leading to a contradiction.

4. Upper bounds for learning with α-privacy

In this section, we show an upper bound on the sample requirement of learning with α-privacy by presenting a learning algorithm that works on infinite hypothesis classes over continuous data domains, under certain conditions on the hypothesis class and the data distribution. Our algorithm works in the non-realizable case, that is, when there may be no hypothesis in the target hypothesis class with zero classification error.

A natural way to extend the algorithm of Kasiviswanathan et al. (2008) to an infinite hypothesis class Inline graphic is to compute a suitable finite subset of that contains a hypothesis with low excess generalization error, and then use the exponential mechanism of McSherry and Talwar (2007) on . To ensure that a hypothesis with low error is indeed in , we would like to be an ε-cover of the disagreement metric ( Inline graphic , ρ). In a non-private or label-private learning, we can compute such a directly based on the unlabeled training examples; in our setting, the training examples themselves are sensitive, and this approach does not directly apply.

The key idea behind our algorithm is that instead of using the sensitive data to compute Inline graphic , we can use a reference distribution that is known independently of the sensitive data. For instance, if the domain of the unlabeled data is bounded, then a reasonable choice for is the uniform distribution over the domain. Our key observation is that if is close to the unlabeled data distribution Inline graphic according to a certain measure of closeness inspired by Dasgupta (2005) and Freund et al. (1997), then a cover of the disagreement metric on with respect to is a (possibly coarser) cover of the disagreement metric on with respect to . Thus we can set to be a fine cover of ( , ρ), and this cover can be computed privately as it is independent of the sensitive data.

Our algorithm works when the doubling dimension of ( Inline graphic , ρ) is finite; under this condition, there is always such a finite cover . We note that this is a fairly weak condition that is satisfied by many hypothesis classes and data distributions. For example, any hypothesis class with finite VC dimension will satisfy this condition for any unlabeled data distribution Inline graphic .

Finally, it may be tempting to think that one can further improve the sample requirement of our algorithm by using the sensitive data to privately refine a cover of ( Inline graphic , ρ) to a cover of ( , ρ). However, our calculations show that naively refining such a cover leads to a much higher sample requirement.

We now define our notion of closeness.

Definition 6

We say that a data distribution Inline graphic is κ-smooth with respect to a distribution for some κ ≥ 1, if for all measurable sets A ⊆ ,

{Pr}_{x ~ D} [x \in A] \leq κ \cdot {Pr}_{x ~ U} [x \in A] .

This notion of smoothness is very similar to, but weaker than the notions of closeness between distributions that have been used by (Dasgupta, 2005; Freund et al., 1997). We note that if Inline graphic is absolutely continuous with respect to (i.e., assigns zero probability to a set only if does also), then is κ-smooth with respect to for some finite κ.

4.1. Algorithm

Our main algorithm Inline graphic is given in Figure 1. The first step of the algorithm calculates the distance scale at which it should construct a cover of ( , ρ). This scale is a function of |S|, the size of the input data set S, and can be computed privately because |S| is not sensitive information. A suitable cover of ( Inline graphic , ρ) that is also a suitable packing of ( , ρ) is then constructed; note that such a set always exists because of Lemma 13. In the final step, an exponential mechanism (McSherry and Talwar, 2007) is used to select a hypothesis from the cover with low error. As this step of the algorithm is the only one that uses the input data, the algorithm is α-private as long as this last step guarantees α-privacy.

4.2. Privacy and learning guarantees

Our first theorem states the privacy guarantee of Algorithm Inline graphic .

Theorem 7

Algorithm Inline graphic preserves α-privacy.

Proof

The algorithm only accesses the private dataset S in the final step. Because changing one labeled example in S changes err(g, S) by at most 1, this step is guarantees α-privacy (McSherry and Talwar, 2007).

The next theorem provides an upper bound on the sample requirement of Algorithm Inline graphic . This bound depends on the doubling dimension d of ( , ρ) and the smoothness parameter κ, as well as the privacy and learning parameters α, ε, δ.

Theorem 8

Let Inline graphic be a distribution over × {±1} whose marginal over is . There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If

the doubling dimension d of ( , ρ) is finite,
is κ-smooth with respect to ,
S ⊆ × {±1} is an i.i.d. random sample from such that
$∣ S ∣ \geq C \cdot (\frac{1}{α ε} + \frac{1}{ε^{2}}) \cdot (d_{U} \cdot log \frac{κ}{ε} + log \frac{1}{δ}),$ (4)

then with probability at least 1 − δ, the hypothesis h ∈ returned by (S, , α, ε, δ) satisfies
${Pr}_{(x, y) ~ P} [h_{A} (x) \neq y] \leq inf_{h^{'} \in H} {Pr}_{(x, y) ~ P} [h^{'} (x) \neq y] + ε .$

The proof of Theorem 8 is stated in Appendix C. If we have prior knowledge that some hypothesis in Inline graphic has zero error (the realizability assumption), then the sample requirement can be improved with a slightly modified version of Algorithm . This algorithm, called Algorithm , is given in Figure 3 in Appendix C.

Theorem 9

Let Inline graphic be any probability distribution over ×{±1} whose marginal over is . There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If

the doubling dimension d of ( , ρ) is finite,
is κ-smooth with respect to ,
S ⊆ × {±1} is an i.i.d. random sample from such that
$∣ S ∣ \geq C \cdot \frac{1}{α ε} \cdot (d_{U} \cdot log (κ / ε) + log \frac{1}{δ}),$ (5)
there exists h^* ∈ with Pr₍_x,y_{) ~} [h^*(x) ≠ y] = 0, then with probability at least 1 − δ, the hypothesis h ∈ returned by (S, , α, ε, δ) satisfies
${Pr}_{(x, y) ~ P} [h_{A} (x) \neq y] \leq ε .$

Again, the proof of Theorem 9 is in Appendix C.

4.3. Examples

In this section, we give some examples that illustrate the sample requirement of Algorithm Inline graphic .

First, we consider the example from the lower bound given in the proof of Theorem 5.

Example 1

The domain of the data is Inline graphic := [0, 1], and the hypothesis class is := = {h_t: t ∈ [0, 1]} (recall, h_t(x) = 1 if and only if x ≥ t). A natural choice for the reference distribution is the uniform distribution over [0, 1]; the doubling dimension of ( , ρ) is 1 because every interval can be covered by two intervals of half the length. Fix some M > 0 and α ∈ (0, 1), and let η:= 1/(6 + 4 exp(αM)). For z ∈ [η, 1 − η], let Inline graphic be the distribution on [0, 1] with density

p_{D_{z}} (x) : = {\begin{cases} \frac{1}{2} + \frac{3}{4 η} & i f z - η / 3 \leq x \leq z + η / 3, \\ \frac{1}{2} & i f 0 \leq x < z - η / 3 o r z + η / 3 < x \leq 1. \end{cases}

Clearly, Inline graphic is κ-smooth with respect to for $κ = \frac{1}{2} + \frac{3}{4 η} = O (exp (α M))$ . Therefore the sample requirement of Algorithm to learn with α-privacy and excess generalization error ε is at most

C \cdot (\frac{1}{ε α} + \frac{1}{ε^{2}}) \cdot (α M + log \frac{1}{δ})

which is Õ(M) for constant ε, matching the lower bound from Theorem 5 up to constants.

Next, we consider two examples in which the domain of the unlabeled data Inline graphic := is the uniform distribution on the unit sphere in ℝⁿ:

S^{n - 1} : = {x \in R^{n} : | | x | | = 1}

and the target hypothesis class Inline graphic := is the class of linear separators that pass through the origin in ℝⁿ:

H_{linear} : = {h_{w} : w \in S^{n - 1}} where h_{w} (x) = 1 if and only if w \cdot x \geq 0.

The examples will consider two different distributions over Inline graphic .

A natural reference data distribution in this setting is the uniform distribution over Inline graphic ; this will be our reference distribution . It is known that d:= sup{ddim_r( , ρ): r ≥ 0} = O(n) (Bshouty et al., 2009).

Example 2

We consider a case where the unlabeled data distribution Inline graphic is concentrated near an equator of . More formally, for some vector u ∈ , and γ ∈ (0, 1), we let be uniform over W:= {x ∈ : |u · x| ≤ γ}; in other words, the unlabeled data lies in a small band of width γ around the equator.

By Lemma 20 (see Appendix C), Inline graphic is κ-smooth with respect to for $κ = \frac{1}{1 - 2 exp (- η γ^{2} / 2)}$ . Thus the sample requirement of Algorithm to learn with α-privacy and excess excess generalization error ε is at most

C \cdot (\frac{1}{α ε} + \frac{1}{ε^{2}}) \cdot (n \cdot log (\frac{1}{ε} \cdot \frac{1}{1 - 2 exp (- n γ^{2} / 2)}) + log \frac{1}{δ}) .

When n is large and $γ \geq 1 / \sqrt{n}$ , this bound is $\tilde{O} (\frac{n}{α ε} + \frac{n}{ε^{2}})$ , where the Õ notation hides factors logarithmic in 1/δ and 1/ε.

Example 3

Now we consider the case where the unlabeled data lies on two diametrically opposite spherical caps. More formally, for some vector u ∈ Inline graphic , and γ ∈ (0, 1), we now let be uniform over \W, where W:= {x ∈ : |u · x| ≤ γ}; in other words, the unlabeled data lies outside a band of width γ around the equator.

By Lemma 21 (see Appendix C), Inline graphic is κ-smooth with respect to for $κ = {(\frac{2}{1 - γ})}^{\frac{n - 1}{2}}$ . Thus the sample requirement of Algorithm is to learn with α-privacy and excess generalization error ε is at most:

C \cdot (\frac{1}{α ε} + \frac{1}{ε^{2}}) \cdot (n^{2} \cdot log \frac{2}{1 - γ} + n \cdot log \frac{1}{ε} + log \frac{1}{δ}) .

Thus, for large n and constant γ < 1, the sample requirement of Algorithm Inline graphic is $\tilde{O} (\frac{n^{2}}{ε^{2}} + \frac{n^{2}}{α ε})$ . So, even though the smoothness parameter κ is exponential in the dimension n, the sample requirement remains polynomial in n.

5. Lower bounds for learning with α-label privacy

In this section, we provide two lower bounds on the sample complexity of learning with α-label privacy. Our first lower bound holds when α and ε are small (that is, high privacy and high accuracy), and when the hypothesis class has bounded VC dimension V. If these conditions hold, then we show a lower bound of Ω(d/εα) where d is the doubling dimension of the disagreement metric ( Inline graphic , ρ) at some scale.

The main idea behind our bound is to show that differentially private learning algorithms necessarily perform poorly when there is a large set of hypotheses such that every pair in the set labels approximately 1/α examples differently. We then show that such large sets can be constructed when the doubling dimension of the disagreement metric ( Inline graphic , ρ) is high.

5.1. Main results

Theorem 10

There exists a constant c > 0 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V < ∞, be a distribution over , X be an i.i.d. sample from of size m, and be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in . Let d:= ddim₁₂_ε ( , ρ) > 2, and d′:= inf{ddim₁₂_r( , ρ): ε ≤ r < Δ/6} > 2. If

ε < c \cdot (\frac{Δ}{V (1 + log (1 / Δ))}), α \leq c \cdot (\frac{d^{'}}{V log (1 / ε)}), and m < c \cdot (\frac{d}{α ε})

where Δ is the diameter of ( Inline graphic , ρ), then there exists a hypothesis h^* ∈ such that with probability at least 1/8 over the random choice of X and internal randomness of , the hypothesis h returned by (S_X,h^*) has classification error

{Pr}_{x ~ D} [h_{A} (x) \neq h^{*} (x)] > ε .

We note that the conditions on α and ε can be relaxed by replacing the VC dimension with other (possibly distribution-dependent) quantities that determine the uniform convergence of ρ_X to ρ; we used a distribution-free parameter to simplify the argument. Moreover, the condition on ε can be reduced to ε < c for some constant c ∈ (0, 1) provided that there exists a lower bound of Ω(V/ε) to (non-privately) learn Inline graphic under the distribution .

The proof of Theorem 10, which is in Appendix D, relies on the following lemma (possibly of independent interest) which gives a lower bound on the empirical error of the hypothesis returned by an α-label private learning algorithm.

Lemma 11

Let X ⊆ Inline graphic be an unlabeled dataset of size m, be a hypothesis class, be a learning algorithm that guarantees α-label privacy, and s > 0. Pick any h₀ ∈ . If P is an s-packing of B_X(h₀, 4s) ⊆ , and

m < \frac{log (\frac{∣ P ∣}{2} - 1)}{8 α s},

then there exists a subset Q ⊆ P such that

|Q| ≥ |P|/2;
for all h ∈ Q, Pr [ (S_X,h) ∉ B_X(h, s/2)] ≥ 1/2.

The proof of Lemma 11 is in Appendix D. The next theorem shows a lower bound without restrictions on ε and α. Moreover, this bound also applies when the VC dimension of the hypothesis class is unbounded. However, we note that this bound is weaker in that it does not involve a 1/ε factor, where ε is the accuracy parameter.

Theorem 12

Let Inline graphic be a hypothesis class, be a distribution over , X be an i.i.d. sample from of size m, and be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in . Let d″:= ddim₄_ε ( , ρ) ≥ 1. If ε ≤ Δ/2 and

m \leq \frac{(d^{″} - 1) log 2}{α}

where Δ is the diameter of ( Inline graphic , ρ), then there exists h^* ∈ such that with probability at least 1/2 over the random choice of X and internal randomness of , the hypothesis h returned by (S_X,h^*) has classification error

{Pr}_{x ~ D} [h_{A} (x) \neq h^{*} (x)] > ε .

In other words, any α-label private algorithm for learning a hypothesis in Inline graphic with error at most ε ≤ Δ/2 must use at least (d″ − 1) log(2)/α examples. Theorem 12 uses ideas similar to those in (Beimel et al., 2010), but the result is stronger in that it applies to α-label privacy and continuous data domains. A detailed proof is provided in Appendix D.

5.2. Example: linear separators in ℝⁿ

In this section, we show an example that illustrates our label privacy lower bounds. Our example hypothesis class Inline graphic := is the class of linear separators over ℝⁿ that pass through the origin, and the unlabeled data distribution is the uniform distribution over the unit sphere . By Lemma 25 (see Appendix D), the doubling dimension of ( , ρ) at any scale r is at least n − 2. Therefore Theorem 10 implies that if α and ε are small enough, any α-label private algorithm Inline graphic that correctly learns all hypotheses h ∈ with error ≤ ε requires at least $Ω (\frac{n}{ε α})$ examples. (In fact, the condition on ε can be relaxed to ε ≤ c for some constant c ∈ (0, 1), because Ω(n) examples are needed to even non-privately learn in this setting (Long, 1995).) We also observe that this bound is tight (except for a log(1/δ) factor): as the doubling dimension of Inline graphic is at most n, in the realizable case, Algorithm using := learns linear separators with α-label privacy given $\tilde{O} (\frac{n}{α ε})$ examples.

Learning algorithm for α-privacy under the realizable assumption.

Acknowledgments

KC would like to thank NIH U54 HL108460 for research support. DH was partially supported by AFOSR FA9550-09-1-0425, NSF IIS-1016061, and NSF IIS-713540.

Appendix A. Metric spaces

Lemma 13 (Kolmogorov and Tikhomirov, 1961)

For any metric space ( Inline graphic , ρ) with diameter Δ, and any ε ∈ (0, Δ), there exists an ε-packing of ( , ρ) that is also an ε-cover.

Lemma 14 (Gupta, Krauthgamer, and Lee, 2003)

For any ε > 0 and r > 0, if a metric space ( Inline graphic , ρ) has doubling dimension d and z ∈ , then every ε-packing of (B(z, r), ρ) has cardinality at most (4r/ε)^d.

Lemma 15

Let ( Inline graphic , ρ) be a metric space with diameter Δ, and r ∈ (0, 2Δ). If ddim_r( , ρ) ≥ d, then there exists z ∈ such that B(z, r) has an (r/2)-packing of size at least 2^d.

Proof

Fix r ∈ (0, 2Δ) and a metric space ( Inline graphic , ρ) with diameter Δ. Suppose that for every z ∈ , every (r/2)-packing of B(z, r) has size less than 2^d. For each z ∈ , let P_z be an (r/2)-packing of (B(z, r), ρ) that is also an (r/2)-cover—this is guaranteed to exist by Lemma 13. Therefore, for each z ∈ , B(z, r) ⊆ ∪_{z′∈P_z} B(z, r/2), and |P_z| < 2^d. This implies that ddim_r( Inline graphic , ρ) is less than d.

Appendix B. Uniform convergence

Lemma 16 (Vapnik and Chervonenkis, 1971)

Let Inline graphic be a family of measurable functions f: → {0, 1} over a space with distribution . Denote by [f] the empirical average of f over a subset Z ⊆ . Let ε_m:= (4/m)(log( (2m)) + log(4/δ)), where (n) is the n-th VC shatter coefficient with respect to . Let Z be an i.i.d. sample of size m from Inline graphic . With probability at least 1 − δ, for all f ∈ ,

E [f] \geq E_{Z} [f] - min {\sqrt{E_{Z} [f] ε_{m}}, \sqrt{E [f] ε_{m}} + ε_{m}} .

Also, with probability at least 1 − δ, for all f ∈ Inline graphic ,

E [f] \leq E_{Z} [f] + min {\sqrt{E [f] ε_{m}}, \sqrt{E_{Z} [f] ε_{m}} + ε_{m}} .

Lemma 17

Let Inline graphic be a hypothesis class with VC dimension V. Fix any δ ∈ (0, 1), and let X be an i.i.d. sample of size m ≥ V/2 from . Let ε_m:= (8V log(2em/V) + 4 log(4/δ))/m. With probability at least 1 − δ, for all pairs of hypotheses {h, h′} ⊆ ,

ρ_{D} (h, h^{'}) \geq ρ_{X} (h, h^{'}) - min {\sqrt{ρ_{X} (h, h^{'}) ε_{m}}, \sqrt{ρ_{D} (h, h^{'}) ε_{m}} + ε_{m}} .

Also, with probability at least 1 − δ, for all pairs of hypotheses {h, h′} ⊆ Inline graphic ,

ρ_{X} (h, h^{'}) \geq ρ_{D} (h, h^{'}) - \sqrt{ρ_{D} (h, h^{'}) ε_{m}} .

Proof

This is an immediate consequence of Lemma 16 as applied to the function class Inline graphic := {x ↦ [h(x) ≠ h′ (x)]: h, h′ ∈ }, which has VC shatter coefficients (2m) ≤ (2m)² ≤ (2em/V)²^V by Sauer’s Lemma.

Appendix C. Proofs from Section 4

C.1. Some lemmas

We first give two simple lemmas. The first one, Lemma 18 states some basic properties of the exponential mechanism.

Lemma 18 (McSherry and Talwar, 2007)

Let I be a finite set of indices, and let a_i ∈ ℝ for all i ∈ I. Define the probability distribution p:= (p_i: i ∈ I) where p_i ∝ exp(−a_i) for all i ∈ I. If j ∈ I is drawn at random according to p, then the following holds for any element i₀ ∈ I and any t ∈ ℝ.

Let i ∈ I. If a_i ≥ t, then Pr_j₋_p[j = i] ≤ exp(−(t − a_i₀)).
Pr_j_~_p[a_j ≥ a_i₀+ t] ≤ |I| exp(−t).

Proof

Fix any i₀ ∈ I and t ∈ ℝ. To show the first part of the lemma, note that for any i ∈ I with a_i ≥ t, we have

{Pr}_{j ~ p} [j = i] = \frac{exp (- a_{i})}{\sum_{i^{'} \in I} exp (- a_{i^{'}})} \leq \frac{exp (- t)}{exp (- a_{i_{0}})} = exp (- (t - a_{i_{0}})) .

For the second part, we apply the inequality from the first part to all i ∈ I such that a_i ≥ a_i₀ + t, so

The next lemma is consequences of smoothness between distributions Inline graphic and .

Lemma 19

If Inline graphic is κ-smooth with respect to , then for all ε > 0, every ε-cover of ( , ρ) is a κε-cover of ( , ρ).

Proof

Suppose C is an ε-cover of ( Inline graphic , ρ). Then, for any h ∈ , there exists h′ ∈ C such that ρ(h, h′) ≤ ε. Fix such a pair h, h′, and let A := {x ∈ χ : h(x) ≠ h′(x)} be the subset of χ on which h and h′ disagree. As is κ-smooth with respect to , by definition of smoothness,

ρ_{D} (h, h^{'}) = {Pr}_{x ~ D} [x \in A] \leq κ \cdot {Pr}_{x ~ D} [x \in A] = κ \cdot ρ_{U} (h, h^{'}) \leq κ ε,

and thus C is a κε-cover of ( Inline graphic , ρ).

C.2. Proof of Theorem 8

First, because of the lower bound on m := |S| from (4), the computed value of κ̂ in the first step of the algorithm must satisfy κ̂ ≥ κ. Therefore, Inline graphic is also κ̂-smooth with respect to . Combining this with Lemma 19, is an (ε/4)-cover of ( , ρ). Moreover, as is also an (ε/4κ̂)-packing of , from Lemma 14, the cardinality of is at most | | ≤ (16κ̂/ε)^d.

Define err(h) := Pr₍_x_,_y_)~[h(x) ≠ y]. Suppose that h* ∈ Inline graphic minimizes err(h) over h ∈ . Let g₀ ∈ be an element of such that ρ(h*, g₀) ≤ ε/4; g₀ exists as is an (ε/4)-cover of ( , ρ_D). By the triangle inequality, we have that:

err (g_{0}) \leq err (h^{*}) + ρ_{D} (h^{*}, g_{0}) \leq err (h^{*}) + ε / 4

(6)

Let E be the event that max_g_∈ | err(g) − err(g, S)| > ε/4, and Ē be its complement. By Hoeffding’s inequality, a union bound, and the lower bound on |S|, we have that for a large enough value of the constant C in Equation (4),

{Pr}_{S ~ P^{m}} [E] \leq ∣ G ∣ max_{g \in G} {Pr}_{S ~ P^{m}} [∣ err (g) - err (g, S) ∣ > \frac{ε}{4}] \leq 2 ∣ G ∣ exp (- \frac{∣ S ∣ ε^{2}}{32}) \leq \frac{δ}{2} .

In the event Ē, we have err(h) ≥ err(h, S) − ε/4 and err(g₀) ≤ err(g₀, S) + ε/4 because both h and g₀ are in Inline graphic . Therefore,

\begin{array}{l} {Pr}_{S ~ P^{m}, A_{1}} [err (h_{A}) > err (h^{*}) + ε] \leq {Pr}_{S ~ P^{m}, A_{1}} [err (h_{A}) > err (g_{0}) + \frac{3 ε}{4}] \\ \leq {Pr}_{S ~ P^{m}} [E] + {Pr}_{A_{1}} [err (h_{A}) > err (g_{0}) + \frac{3 ε}{4} ∣ \bar{E}] \\ \leq \frac{δ}{2} + {Pr}_{A_{1}} [err (h_{A}, S) > err (g_{0}, S) + \frac{ε}{4} ∣ \bar{E}] \\ \leq \frac{δ}{2} + ∣ G ∣ exp (- \frac{α ∣ S ∣ ε}{8}) \\ \leq \frac{δ}{2} + {(\frac{16 \hat{κ}}{ε})}^{d_{U}} exp (- \frac{α ∣ S ∣ ε}{8}) \\ \leq δ \end{array}

Here, the first step follows from (7), and the final three inequalities follow from Lemma 18 (using a_g = α|S| err(g, S)/2 for g ∈ Inline graphic ), the upper bound on | |, and the lower bound on m in (4).

C.3. Proof of Theorem 9

The proof is very similar to the proof of Theorem 8.

First, because of the lower bound on m := |S| from (5), the computed value of κ̂ in the first step of the algorithm must satisfy κ̂ ≥ κ. Therefore, Inline graphic is also κ̂-smooth with respect to . Combining this with Lemma 19, as is an (ε/4κ̂)-cover of , is an (ε/4)-cover of ( , ρ). Moreover, as is also an (ε/4κ̂)-packing of , from Lemma 14, the cardinality of is at most | | ≤ (16κ̂/ε)^d.

Define err(h) := Pr₍_x_,_y_)~[h(x) ≠ y]. Suppose that h* ∈ Inline graphic minimizes err(h) over h ∈ . Recall that from the realizability assumption, err(h*) = 0. Let g₀ ∈ be an element of such that ρ_dcl12;(h*, g₀) ≤ ε/4; g₀ exists as is an (ε/4)-cover of ( , ρ). By the triangle inequality, we have that:

err (g_{0}) \leq err (h^{*}) + ρ_{D} (h^{*}, g_{0}) \leq ε / 4

(7)

We define two events E₁ and E₂. Let Inline graphic ⊂ G be the set of all g ∈ for which err(g) ≥ ε. The event E₁ is the event that min_g_∈ err(g, S) > 9ε/10, and let Ē₁ be its complement. Applying the multiplicative Chernoff bounds, for a specific g ∈ ,

{Pr}_{S ~ P^{m}} [err (g, S) < \frac{9}{10} err (g)] \leq e^{- ∣ S ∣ err (g) / 400} \leq e^{- ∣ S ∣ ε / 400} .

The quantity on the right hand side is at most $\frac{δ}{4 ∣ G ∣} \leq \frac{δ}{4 ∣ G_{1} ∣}$ for a large enough constant C in Equation (5). Applying an union bound over all g ∈ Inline graphic , we get that

{Pr}_{S ~ P^{m}} [{\bar{E}}_{1}] \leq δ / 4.

(8)

We define E₂ as the event that err(g₀, S) ≤ 3ε/4, and Ē₂ as its complement. From a standard multiplicative Chernoff bound, with probability at least 1 − δ/4,

err (g_{0}, S) \leq err (g) + \sqrt{\frac{3 err (g) ln (4 / δ)}{∣ S ∣}} \leq \frac{ε}{4} + \sqrt{\frac{3 ε}{4} \cdot \frac{ln (4 / δ)}{∣ S ∣}} \leq \frac{ε}{4} + \sqrt{\frac{3 ε}{4} \cdot \frac{ε}{3}} = \frac{3 ε}{4}

Thus, if |S| ≥ (3/ε) log(4/δ), which is the case due to Equation (5),

{Pr}_{S ~ P^{m}} [{\bar{E}}_{2}] = {Pr}_{S ~ P^{m}} [err (g_{0}, S) > \frac{3 ε}{4}] \leq \frac{δ}{4} .

(9)

Therefore, we have

\begin{array}{l} {Pr}_{S ~ P^{m}, A} [err (h_{A}) > ε] \leq {Pr}_{S ~ P^{m}, A} [err (h_{A}) > ε ∣ E_{1} \cap E_{2}] + {Pr}_{S ~ P^{m}} [{\bar{E}}_{1}] + {Pr}_{S ~ P^{m}} [{\bar{E}}_{2}] \\ \leq {Pr}_{S ~ P^{m}, A} [err (h_{A}, S) > err (g_{0}, S) + (\frac{9}{10} - \frac{3}{4}) ε ∣ E_{1} \cap E_{2}] + δ / 4 + δ / 4 \\ \leq {Pr}_{S ~ P^{m}, A} [err (h_{A}, S) > err (g_{0}, S) + \frac{3 ε}{20} ∣ E_{1} \cap E_{2}] + δ / 2 \\ \leq ∣ G ∣ exp (- \frac{3 ε ∣ S ∣}{20}) + δ / 2 \\ \leq {(\frac{16 \hat{κ}}{ε})}^{d_{U}} exp (- \frac{3 ε ∣ S ∣}{20}) + δ / 2 \\ \leq δ / 2 + δ / 2 = δ . \end{array}

Here, the second step follows from the definition of events E₁ and E₂ and from Equations (8) and (9), the third step follows from simple algebra, the fourth step follows from Lemma 18, the fifth step from the bound on | Inline graphic | and the final step from Equation (5).

C.4. Examples

Lemma 20

Let Inline graphic be uniform over the unit sphere , and let be defined as in Example 2. Then, is

\frac{1}{1 - 2 exp (- n γ^{2} / 2)} - smooth

with respect to Inline graphic .

Proof

From (Ball, 1997), we know that Pr_x_~[x ∈ W] ≥ 1 − 2 exp(−nγ²/2). Thus, for any set A ⊆ Inline graphic , we have

{Pr}_{x ~ D} [x \in A] = {Pr}_{x ~ D} [x \in A \cap W] = \frac{{Pr}_{x ~ U} [x \in A \cap W]}{{Pr}_{x ~ U} [x \in W]} \leq \frac{{Pr}_{x ~ U} [x \in A]}{1 - 2 exp (- n γ^{2} / 2)} .

This means Inline graphic is κ-smooth with respect to for 1

κ = \frac{1}{1 - 2 exp (- n γ^{2} / 2)} .

Lemma 21

Let Inline graphic uniform over the unit sphere and let be defined as in Example 3. Then, is

{(\frac{2}{1 - γ})}^{\frac{n - 1}{2}} - smooth

with respect to Inline graphic .

Proof

From (Ball, 1997), we know that Pr_x_~[x ∈ Inline graphic \ W] = Pr_x_~[x ∉ W] ≥ ((1 − γ)/2)⁽ⁿ^−1)/2. Therefore, for any A ⊆ , we have

{Pr}_{x ~ D} [x \in A] = {Pr}_{x ~ D} [x \in A \ W] = \frac{{Pr}_{x ~ U} [x \in A \ W]}{{Pr}_{x ~ U} [x \in S^{n - 1} \ W]} \leq \frac{{Pr}_{x ~ U} [x \in A]}{{(\frac{1 - γ}{2})}^{\frac{n - 1}{2}}} .

This means Inline graphic is κ-smooth with respect to for

κ = {(\frac{2}{1 - γ})}^{\frac{n - 1}{2}} .

Appendix D. Proofs from Section 5

D.1. Some lemmas

Lemma 22

Let S := {(x₁, y₁), …, (x_m, y_m)} ⊆ χ × {±1} be a labeled dataset of size m, α ∈ (0, 1), and k ≥ 0.

If a learning algorithm guarantees α-privacy and outputs a hypothesis from , then for all $S^{'} : = {(x_{1}^{'}, y_{1}^{'}), \dots, (x_{m}^{'}, y_{m}^{'})} \subseteq X \times {\pm 1}$ with $(x_{i}, y_{i}) = (x_{i}^{'}, y_{i}^{'})$ for at least |S| − k such examples,
$\forall G \subseteq H • {Pr}_{A} [A (S) \in G] \geq {Pr}_{A} [A (S^{'}) \in G] \cdot exp (- k α) .$
If a learning algorithm guarantees α-label privacy and outputs a hypothesis from , then for all $S^{'} : = {(x_{1}, y_{1}^{'}), \dots, (x_{m}, y_{m}^{'})} \subseteq X \times {\pm 1}$ with $y_{i} = y_{i}^{'}$

for at least |S| − k such labels,
$\forall G \subseteq H • {Pr}_{A} [A (S) \in G] \geq {Pr}_{A} [A (S^{'}) \in G] \cdot exp (- k α) .$

Proof

We prove just the first part, as the second part is similar. For a labeled dataset S′ that differs from S in at most k pairs, there exists a sequence of datasets S⁽⁰⁾, …, S⁽^ℓ⁾ with ℓ ≤ k such that S⁽⁰⁾ = S′, S⁽^ℓ⁾ = S, and S⁽^j⁾ differs from S⁽^j⁺¹⁾ in exactly one example for 1 ≤ j < ℓ. In this case, if Inline graphic guarantees α-privacy, then for all ⊆ ,

\begin{array}{l} {Pr}_{A} [A (S^{(0)}) \in G] \leq {Pr}_{A} [A (S^{(1)}) \in G] \cdot e^{α} \\ \leq {Pr}_{A} [A (S^{(2)}) \in G] \cdot e^{2 α} \\ \leq \dots \\ \leq {Pr}_{A} [A (S^{(ℓ)}) \in G] \cdot e^{ℓ α} \end{array}

and therefore

{Pr}_{A} [A (S) \in G] \geq {Pr}_{A} [A (S^{'}) \in G] \cdot e^{- ℓ α} \geq {Pr}_{A} [A (S^{'}) \in G] \cdot e^{- k α} .

Lemma 23

There exists a constant C > 1 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V, and let be distribution over χ. Fix any r ∈ (0, 1), and let X be an i.i.d. sample of size m from . If

m \geq \frac{C V}{τ} log \frac{C}{r},

then the following holds with probability at least 1/2:

every pair of hypotheses{h, h′} ⊆ for which ρ_X(h, h′) > 2r has ρ(h, h′) > r;
for all h₀ ∈ , every (6r)-packing of (B(h₀, 12r), ρ) is a (4r)-packing of (B_X(h₀, 16r), ρ_X).

Proof

This is a consequence of Lemma 17. To show the first part, we plug in Lemma 17 with ε_m = r/2.

To show the second part, we use two applications of Lemma 17. Let h and h′ be any two hypotheses in any (6r)-packing of (B(h₀, 12r), ρ); we first use Lemma 17 with ε_m = r/3 to show that for all such h and h′, ρ_X(h, h′) > 4r. Next we need to show that all h in any (6r)-packing of (B(h₀, 12r), ρ) has ρ_X(h, h₀) ≤ 16r; we show this through a second application of Lemma 17 with ε_m = r/3.

D.2. Proof of Theorem 12

We prove the contrapositive: that if ε ≤ Δ/2 and Pr_X_~
,[ Inline graphic (S_X_,_h_*) ∈ B(h*, ε)] > 1/2 for all h* ∈ H, then m > log(2^d^″−1)/α. So pick any ε ≤ Δ/2. By Lemma 15, there exists an h₀ ∈ and P ⊆ such that P is a (2ε)-packing of (B(h₀, 4ε), ρ) of size ≥ 2d^″. For any h, h^′ ∈ P such that h ≠ h′, we have B(h, ε) ∩ B(h′, ε) = ∅ by the triangle inequality. Therefore for any h ∈ P and any X′ ⊆ χ of size m,

\begin{array}{l} {Pr}_{A} [A (S_{X^{'}, h}) \notin B_{D} (h, ε)] \geq \sum_{h^{'} \in P \ {h}} {Pr}_{A} [A (S_{X^{'}, h}) \in B_{D} (h^{'}, ε)] \\ \geq \sum_{h^{'} \in P \ {h}} {Pr}_{A} [A (S_{X^{'}, h^{'}}) \in B_{D} (h^{'}, ε)] \cdot e^{- α m}, \end{array}

where the second inequality follows by Lemma 22 because S_X_′,_h and S_X_′,_h_′ can differ in at most (all) m labels. Now integrating both sides with respect to X′ ~ Inline graphic shows that if Pr_X_~
,[ (S_X_,_h_*) ∈ B(h*, ε)] > 1/2 for all h* ∈ , then for any h ∈ P,

\begin{array}{l} \frac{1}{2} > {Pr}_{X ~ D^{m}, A} [A (S_{X, h}) \notin B_{D} (h, ε)] \geq \sum_{h^{'} \in P \ {h}} {Pr}_{X ~ D^{m}, A} [A (S_{X^{'}, h^{'}}) \in B_{D} (h^{'}, ε)] \cdot e^{- α m} \\ > (∣ P ∣ - 1) \cdot \frac{1}{2} \cdot e^{- α m} \end{array}

which in turn implies m > log(|P| − 1)/α ≥ log(2^d^″ − 1)/α ≥ log(2^d^″−1)/α, as d″ is always ≥ 1.

D.3. Proof of Lemma 11

Let h₀ ∈ Inline graphic and P be an s-packing of B_X(h₀, 4s) ⊆ . Say the algorithm is good for h if Pr[ (S_X_,_h) ∈ B_X(h, s/2)] ≥ 1/2. Note that is not good for h ∈ P if and only if Pr[ (S_X_,_h) ∉/B_X(h, s/2)] > 1/2. Therefore, it suffices to show that if is good for at least |P|/2 hypotheses in P, then m ≥ (log((|P|/2) −1)/(8αs).

By the triangle inequality and the fact that P is an s-packing, B_X(h, s/2)∩B_X(h′, s/2) = ∅ for all h, h′ ∈ P such that h ≠ h′. Therefore for any h ∈ P,

{Pr}_{A} [A (S_{X, h}) \notin B_{X} (h, s / 2)] \geq \sum_{h^{'} \in P \ {h}} {Pr}_{A} [A (S_{X, h}) \in B_{X} (h^{'}, s / 2)] .

Moreover, for all h, h′ ∈ P, we have ρ_X(h, h′) ≤ ρ_X(h₀, h) + ρ_X(h₀, h′) ≤ 8s by the triangle inequality, so S_X_,_h and S_X_,_h_′ differ in at most 8sm labels. Therefore Lemma 22 implies

{Pr}_{A} [A (S_{X, h}) \in B_{X} (h^{'}, s / 2)] \geq {Pr}_{A} [A (S_{X, h^{'}}) \in B_{X} (h^{'}, s / 2)] \cdot e^{- 8 s m}

for all h, h′ ∈ P. If Inline graphic is good for at least |P|/2 hypotheses h′ ∈ P, then for any h ∈ P such that is good for h, we have

graphic file with name nihms464313e2.jpg

which in turn implies m ≥ log((|P|/2) − 1)/(8s).

D.4. Proof of Theorem 10

We need the following lemma.

Lemma 24

There exists a constant C > 1 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V, be a distribution over χ, X be an i.i.d. sample from of size m, be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in , and Δ be the diameter of ( , ρ). If r ∈ (0, Δ/6) and

\frac{C V}{r} log \frac{C}{r} \leq m < \frac{log (2^{d - 1} - 1)}{32 α r}

where d := ddim₁₂_r( Inline graphic , ρ), then there exists a hypothesis h* ∈ such that

{Pr}_{X ~ D^{m}, A} [A (S_{X, h^{*}}) \notin B_{D} (h^{*}, r)] \geq \frac{1}{8} .

Proof

First, assume r and m satisfy the conditions in the lemma statement, where C is the constant from Lemma 23. Also, let h₀ ∈ Inline graphic and P ⊆ be such that P is a (6r)-packing of (B(h₀, 12r), ρ) of size |P| ≥ 2^d; the existence of such an h₀ and P is guaranteed by Lemma 15.

We first define some events in the sample space of X and Inline graphic . For each h ∈ , and a sample X, let E₁(h, X) be the event that

(S_X_,_h) makes more than 2rm mistakes on S_X_,_h (i.e., ρ_X(h, (S_X_,_h)) > 2r).

Given a sample X, let φ(X) be a 0/1 random variable which is 1 when the following conditions hold:

every pair of hypotheses{h, h′} ⊆ for which ρ_X(h, h′) > 2r has ρ(h, h′) > r; and
for all h₀ ∈ , every (6r)-packing of (B(h₀, 12r), ρ) is a (4r)-packing of (B_X(h₀, 16r), ρ_X)

(i.e., the conclusion of Lemma 23). Note that conditioned on E₁(h, X) and φ(X) = 1, we have ρ_X(h, Inline graphic (S_X_,_h)) > 2r and thus ρ(h, (S_X_,_h)) > r, so Pr_X_~
,[E₁(h, X), (φ(X) = 1)] ≤ Pr_X_~
,[ρ(h, (S_X_,_h)) > r]. Therefore it suffices to show that there exists h* ∈ such that Pr_X_~
,[E₁(h*, X), (φ(X) = 1)] ≥ 1/8.

The lower bound on m and Lemma 23 ensure that 1

{Pr}_{X ~ D^{m}} [φ (X) = 1] \geq \frac{1}{2} .

(10)

Also, if the unlabeled sample X is such that φ(X) = 1 holds, then the set P is a (4r)-packing of (B_X(h₀, 16r), ρ_X). Therefore, the upper bound on m and Lemma 11 (with s = 4r) imply that for all such X, there exists Q ⊆ P of size at least |P|/2 such that Pr[E₁(h, X) | φ(X) = 1] ≥ 1/2 for all h ∈ Q. In other words,

(11)

Combining Equations (10) and (11) gives

graphic file with name nihms464313e4.jpg

Here the first step follows because φ(X) is a 0/1 random variable, the fourth step follows from Equation (11) and the fifth step follows from Equation (10).

Therefore there exists some h* ∈ P such that Pr_X_~
,[E₁(h*, X), φ(X) = 1] ≥ 1/8.

Proof [Proof of Theorem 10]

Assume

ε < \frac{Δ}{24 CV log (6 C / Δ)}, α \leq \frac{log (2^{d^{'} - 1} - 1)}{32 CV log (C / ε)}, and m < \frac{log (2^{d - 1} - 1)}{32 α ε}

where C is the constant from Lemma 24. The proof is by case analysis, based on the value of m.

Case 1: m < 1/(4ε)

Since ε < Δ/2, Lemma 15 implies that there exists a pair{h, h′} ⊆ Inline graphic such that ρ(h, h′) > 2ε but ρ(h, h′) ≤ 4ε. Using the bound on m and the fact ε ≤ 1/5, we have

{Pr}_{X ~ D^{m}} [ρ_{X} (h, h^{'}) = 0] \geq {(1 - 4 ε)}^{m} \geq {(1 - 4 ε)}^{\frac{1}{4 ε}} > \frac{1}{8} .

This means that Pr_X_~[h:= Inline graphic (S_X_,_h) = (S_X_,_h_′)] ≥ 1/8. By the triangle inequality, B(h, ε) ∩ B(h′, ε) = ∅. So if, say, Pr_X_~
,[h ∈ B(h, ε)] ≥ 1/8, then Pr_X_~
,[h ∉/B(h′, ε)] ≥1/8. Therefore Pr_X_~
,[h ∉/B(h*, ε)] ≥ 1/8 for at least one h* ∈ {h, h′}.

Case 2: 1/(4ε) ≤ m < (CV/ε) log(C/ε)

First, let r > 0 be the solution to the equation (CV/r) log(C/r) = m, so r > ε. Moreover, the bound on m and ε imply

m \geq \frac{1}{4 ε} > \frac{C V}{Δ / 6} log \frac{C}{Δ / 6}

so r < Δ/6. Finally, using the bound on α, definition of d′, and fact r > ε, we have

α \leq \frac{log (2^{d^{'} - 1} - 1)}{32 C V log \frac{C}{ε}} < \frac{log (2^{d^{″} - 1} - 1)}{32 C V log \frac{C}{r}}

where d″ := ddim₁₂_r( Inline graphic , ρ); this implies

m = \frac{C V}{r} log \frac{C}{r} \leq \frac{log (2^{d^{″} - 1} - 1)}{32 α r} .

The conditions of Lemma 24 are thus satisfied, which means there exists h* ∈ Inline graphic such that Pr_X_~
,[ (S_X_,_h_*) ∉/B(h*, r)] ≥ 1/8.

Case 3: (CV/ε) log(C/ε) ≤ m < log(2^d−1 − 1)/(32αε)

The conditions of Lemma 24 are satisfied in this case with r := ε < Δ/6, so there exists h* ∈ H such that Pr_X_~
,[ρ_D(h*, Inline graphic (S_X_,_h_*)) > ε] ≥ 1/8.

D.5. Example

The following lemma shows that if Inline graphic is the uniform distribution on , then ddim_r( , ρ) ≥ n − 2 for all scales r > 0.

Lemma 25

Let Inline graphic := be the class of linear separators through the origin in ℝⁿ and be the uniform distribution on . For any u ∈ and any r > 0, there exists an (r/2)-packing of (B(h_u, r), ρ) of size at least 2ⁿ⁻².

Proof

Let μ be the uniform distribution over Inline graphic ; notice that this is also the uniform distribution over .

We call a pair hypotheses h_v and h_w in Inline graphic close if ρ (h_v, h_w) ≤ r/2. Observe that if any set of hypotheses has no close pairs, then it is an (r/2)-packing.

Using a technique due to Long (1995), we now construct an (r/2)-packing of B(h_u, r) by first randomly choosing hypotheses in B(h_u, r), and then removing hypotheses until no close pairs remain. First, we bound the probability p that two hypotheses h_v and h_w, chosen independently and uniformly at random from B(h_u, r), are close:

\begin{array}{l} p = {Pr}_{(h_{v}, h_{w}) ~ μ^{2}} [ρ_{D} (h_{v}, h_{w}) \leq r / 2 ∣ h_{v} \in B_{D} (h_{u}, r) \land h_{w} \in B_{D} (h_{u}, r)] \\ = \frac{{Pr}_{(h_{v}, h_{w}) ~ μ^{2}} [ρ_{D} (h_{v}, h_{w}) \leq r / 2 \land h_{v} \in B_{D} (h_{u}, r) ∣ h_{w} \in B_{D} (h_{u}, r)]}{{Pr}_{h_{v} ~ μ} [h_{v} \in B_{D} (h_{u}, r)]} \\ \leq \frac{{Pr}_{(h_{v}, h_{w}) ~ μ^{2}} [ρ_{D} (h_{v}, h_{w}) \leq r / 2 ∣ h_{w} \in B_{D} (h_{u}, r)]}{{Pr}_{h_{v} ~ μ} [h_{v} \in B_{D} (h_{u}, r)]} \\ = \frac{{Pr}_{(h_{v}, h_{w}) ~ μ^{2}} [h_{v} \in B_{D} (h_{w}, r / 2) ∣ h_{w} \in B_{D} (h_{u}, r)]}{{Pr}_{h_{v} ~ μ} [h_{v} \in B_{D} (h_{u}, r)]} \\ = \frac{{Pr}_{(h_{v}, h_{w}) ~ μ^{2}} [h_{v} \in B_{D} (h_{u}, r / 2)]}{{Pr}_{h_{v} ~ μ} [h_{v} \in B_{D} (h_{u}, r)]} \\ = 2^{- (n - 1)} . \end{array}

where the second-to-last equality follows by symmetry, and the last equality follows by the fact that B(h_u, r) corresponds to a (n−1)-dimensional spherical cap of Inline graphic . Now, choose N := 2ⁿ⁻¹ hypotheses h_v₁, …, h_{v_N} independently and uniformly at random from B(h_u, r). The expected number of close pairs among these N hypotheses is

graphic file with name nihms464313e5.jpg

Therefore, there exists N hypotheses h_v₁, …, h_{v_N} in B(h_u, r) among which there are at most M close pairs. Removing one hypothesis from each such close pair leaves a set of at least N − M hypotheses with no close pairs—this is our (r/2)-packing of B(h_u, r). Since N = 2ⁿ⁻¹, the cardinality of this packing is at least

N - M \geq 2^{n - 1} - \frac{2^{n - 1} (2^{n - 1} - 1)}{2} \cdot 2^{- (n - 1)} > 2^{n - 2} .

Appendix E. Upper bounds for learning with α-label privacy

Algorithm Inline graphic for learning with α-label privacy, given in Figure 3, differs from the algorithms for learning with α-privacy in that it is able to use the unlabeled data itself to construct a finite set of candidate hypotheses. The algorithm and its analysis are very similar to work due to Chaudhuri et al. (2006); we give the details for completeness.

Theorem 26

Algorithm Inline graphic preserves α-label privacy.

Proof

The algorithm only accesses the labels in S in the final step. It follows from standard arguments in (McSherry and Talwar, 2007) that α-label privacy is guaranteed.

Theorem 27

Let Inline graphic be any probability distribution over χ × {±1} whose marginal over χ is . There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If S ⊆ χ × {±1} is an i.i.d. random sample from of size

m \geq C \cdot (\frac{η}{ε^{2}} + \frac{1}{ε}) \cdot (V log \frac{1}{ε} + log \frac{1}{δ}) + \frac{C}{α ε} \cdot log \frac{E_{X ~ D^{m}} [N_{ε / 8} (H, ρ_{X})]}{δ}

where η := inf_h_′∈ Pr₍_x_,_y_)~[h′(x) ≠ y] and V is the VC dimension of Inline graphic ; then with probability at least 1 − δ, the hypothesis h ∈ returned by (S, α, ε, δ) satisfies

{Pr}_{(x, y) ~ P} [h_{A} (x) \neq y] \leq η + ε .

Remark 28

The first term in the sample size requirement (which depends on VC dimension) can be replaced by distribution-based quantities used for characterizing uniform convergence such as those based on l₁-covering numbers (Pollard, 1984).

Proof

Let err(h) := Pr₍_x_,_y_)~[h(x) ≠ y], and let h* ∈ Inline graphic minimize err(h) over h ∈ . Let S := {(x₁, y₁), …, (x_m, y_m)} be the i.i.d. sample drawn from , and X := {x₁, …, x_m} be the unlabeled components of S. Let g₀ ∈ minimize err(g, S) over g ∈ . Since is an (ε/4)-cover for ( , ρ_X), we have that err(g₀, S) ≤ inf_h_′∈ err(h′, S) + ε/4. Since Inline graphic is also an (ε/4)-packing for ( , ρ_X), we have that | | ≤ ( , ρ_X) (Pollard, 1984). Let := {f_h : h ∈ H} where f_h(x, y) := [h(x) ≠ y]. We have [f_h(x, y)] = err(h) and m⁻¹ Σ₍_x_,_y_)∈_S f_h(x, y) = err(h, S). Let E be the event that for all h ∈ H,

err (h, S) \leq err (h) + \sqrt{err (h) ε_{m}} + ε_{m} and err (h) \leq err (h, S) + \sqrt{err (h) ε_{m}}

where ε_m := (8V log(2em/V) + 4 log(16/δ))/m. By Lemma 16, the fact Inline graphic ( , n) = S(H, n), and union bounds, we have Pr_S_~ [E] ≥ 1 − δ/2. Now let E′ be the event that

err (h_{A}, S) \leq err (g_{0}, S) + t_{m}

where t_m := 2 log(2 Inline graphic [| |]/δ)/(αm). The probability of E′ can be bounded as

\begin{array}{l} {Pr}_{S ~ P^{m}, A} [E^{'}] = 1 - {Pr}_{S ~ P^{m}, A} [err (h_{A}, S) > err (g_{0}, S) + t_{m}] \\ = 1 - E_{S ~ P^{m}} [{Pr}_{A} [err (h_{A}, S) > err (g_{0}, S) + t_{m} ∣ S]] \\ \geq 1 - E_{S ~ P^{m}} [∣ G ∣ exp (- \frac{α m t_{m}}{2})] \\ = 1 - E_{X ~ D^{m}} [∣ G ∣] \cdot exp (- \frac{α {m t}_{m}}{2}) \\ \geq 1 - \frac{δ}{2} \end{array}

where the first inequality follows from Lemma 18, and the second inequality follows from the definition of t_m. By the union bound, Pr_S_~
,[E ∩ E′] ≥ 1 −δ. In the event E ∩ E′, we have

\begin{array}{l} err (h_{A}) - err (h^{*}) \leq err (h_{A}) - err (h_{A}, S) + err (h^{*}, S) - err (h^{*}) + err (h_{A}, S) - err (h^{*}, S) \\ \leq \sqrt{err (h_{A}) ε_{m}} + \sqrt{err (h^{*}) ε_{m}} + ε_{m} + err (g_{0}, S) - err (h^{*}, S) + t_{m} \\ \leq \sqrt{err (h_{A}) ε_{m}} + \sqrt{err (h^{*}) ε_{m}} + ε_{m} + ε / 4 + t_{m} \end{array}

since err(g₀, S) ≤ inf_h_′∈ err(h′, S) + ε/4 ≤ err(h*, S) + ε/4. By various algebraic manipulations, this in turn implies

err (h_{A}) \leq err (h^{*}) + C^{'} \cdot (\sqrt{err (h^{*}) ε_{m}} + ε_{m} + t_{m}) + ε / 2

for some constant C′ > 0. The lower bound on m now implies the theorem.

Footnotes

Here the Õ notation hides factors logarithmic in 1/ε

Contributor Information

Kamalika Chaudhuri, Email: kamalika@cs.ucsd.edu, University of California, San Diego, 9500 Gilman Drive #0404, La Jolla, CA 92093-0404.

Daniel Hsu, Email: dahsu@microsoft.com, Microsoft Research New England, One Memorial Drive, Cambridge, MA 02142.

References

Agrawal R, Srikant R. Privacy-preserving data mining. SIGMOD Rec. 2000;29(2):439–450. doi: http://doi.acm.org/10.1145/335191.335438. [Google Scholar]
Backstrom Lars, Dwork Cynthia, Kleinberg Jon M. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: Williamson Carey L, Zurko Mary Ellen, Patel-Schneider Peter F, Shenoy Prashant J., editors. WWW. ACM; 2007. pp. 181–190. [Google Scholar]
Ball K. An elementary introduction to modern convex geometry. In: Levy Silvio., editor. Flavors of Geometry. Vol. 31. 1997. [Google Scholar]
Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. PODS. 2007:273–282. [Google Scholar]
Beimel Amos, Kasiviswanathan Shiva Prasad, Nissim Kobbi. Bounds on the sample complexity for private learning and private data release. TCC. 2010:437–454. [Google Scholar]
Blum A, Dwork C, McSherry F, Nissim K. Practical privacy: the sulq framework. PODS. 2005:128–138. [Google Scholar]
Blum A, Ligett K, Roth A. A learning theory approach to non-interactive database privacy. In: Ladner RE, Dwork C, editors. STOC. ACM; 2008. pp. 609–618. [Google Scholar]
Bshouty Nader H, Li Yi, Long Philip M. Using the doubling dimension to analyze the generalization of learning algorithms. J Comput Syst Sci. 2009;75(6):323–335. [Google Scholar]
Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Manuscript. 2006. Learning concept classes with privacy. [Google Scholar]
Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk minimization. Journal of Machine Learning Research. 2011 [PMC free article] [PubMed] [Google Scholar]
Chaudhuri Kamalika, Mishra Nina. When random sampling preserves privacy. In: Dwork Cynthia., editor. CRYPTO. Vol. 4117. Springer; 2006. pp. 198–213. Lecture Notes in Computer Science. [Google Scholar]
Dasgupta Sanjoy. Coarse sample complexity bounds for active learning. NIPS. 2005 [Google Scholar]
Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. 3rd IACR Theory of Cryptography Conference; 2006. pp. 265–284. [Google Scholar]
Evfimievski A, Gehrke J, Srikant R. Limiting privacy breaches in privacy preserving data mining. PODS. 2003:211–222. [Google Scholar]
Yoav Freund H, Seung Sebastian, Shamir Eli, Tishby Naftali. Selective sampling using the query by committee algorithm. Machine Learning. 1997;28(2–3):133–168. [Google Scholar]
Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008a:265–273. [Google Scholar]
Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008b:265–273. [Google Scholar]
Gupta Anupam, Krauthgamer Robert, Lee James R. Bounded geometries, fractals, and low-distortion embeddings. FOCS. 2003:534–543. [Google Scholar]
Gupta Anupam, Hardt Moritz, Roth Aaron, Ullman Jonathan. Privately releasing conjunctions and the statistical query barrier. STOC. 2011 [Google Scholar]
Hardt Moritz, Talwar Kunal. On the geometry of differential privacy. STOC. 2010:705–714. [Google Scholar]
Jones Rosie, Kumar Ravi, Pang Bo, Tomkins Andrew. “i know what you did last summer”: query logs and user privacy. CIKM ‘07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management; New York, NY, USA. ACM; 2007. pp. 909–914. doi: http://doi.acm.org/10.1145/1321440.1321573. [Google Scholar]
Kasiviswanathan SA, Lee HK, Nissim K, Raskhodnikova S, Smith A. What can we learn privately? Proc of Foundations of Computer Science. 2008 [Google Scholar]
Kasiviswanathan Shiva Prasad, Smith Adam. A note on differential privacy: Defining resistance to arbitrary side information. CoRR. 2008 abs/0803.3946. [Google Scholar]
Kolmogorov A, Tikhomirov V. ε-entropy and ε-capacity of sets in function spaces. Translations of the American Mathematical Society. 1961;17:277–364. [Google Scholar]
Long PM. On the sample complexity of pac learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks. 1995;6(6):1556–1559. doi: 10.1109/72.471352. [DOI] [PubMed] [Google Scholar]
Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-diversity: Privacy beyond k-anonymity. Proc of ICDE. 2006 [Google Scholar]
McAllester David A. Some pac-bayesian theorems. COLT. 1998:230–234. [Google Scholar]
McSherry Frank, Mironov Ilya. Differentially private recommender systems: building privacy into the net. KDD ‘09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining; New York, NY, USA. ACM; 2009. pp. 627–636. doi: http://doi.acm.org/10.1145/1557019.1557090. [Google Scholar]
McSherry Frank, Talwar Kunal. Mechanism design via differential privacy. FOCS. 2007:94–103. [Google Scholar]
Narayanan Arvind, Shmatikov Vitaly. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy; Oakland, CA, USA. May 2008; IEEE Computer Society; pp. 111–125. [Google Scholar]
Nissim Kobbi, Raskhodnikova Sofya, Smith Adam. Smooth sensitivity and sampling in private data analysis. In: Johnson David S, Feige Uriel., editors. STOC. ACM; 2007. pp. 75–84. [Google Scholar]
Pollard D. Convergence of Stochastic Processes. Springer-Verlag; 1984. [Google Scholar]
Roth Aaron. Differential privacy and the fat-shattering dimension of linear queries. APPROX-RANDOM. 2010:683–695. [Google Scholar]
Sweeney L. k-anonymity: a model for protecting privacy. Int J on Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 [Google Scholar]
Vapnik VN, Chervonenkis AY. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications. 1971;16(2):264–280. [Google Scholar]
Wang Rui, Li Yong Fuga, Wang XiaoFeng, Tang Haixu, Zhou Xiao yong. Learning your identity and disease from research papers: information leaks in genome wide association study. ACM Conference on Computer and Communications Security; 2009. pp. 534–544. [Google Scholar]
Yao Andrew Chi-Chih. Protocols for secure computations (extended abstract) FOCS. 1982:160–164. [Google Scholar]
Zhou Shuheng, Ligett Katrina, Wasserman Larry. Differential privacy with compression. Proc of ISIT. 2009 [Google Scholar]

[R1] Agrawal R, Srikant R. Privacy-preserving data mining. SIGMOD Rec. 2000;29(2):439–450. doi: http://doi.acm.org/10.1145/335191.335438. [Google Scholar]

[R2] Backstrom Lars, Dwork Cynthia, Kleinberg Jon M. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: Williamson Carey L, Zurko Mary Ellen, Patel-Schneider Peter F, Shenoy Prashant J., editors. WWW. ACM; 2007. pp. 181–190. [Google Scholar]

[R3] Ball K. An elementary introduction to modern convex geometry. In: Levy Silvio., editor. Flavors of Geometry. Vol. 31. 1997. [Google Scholar]

[R4] Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. PODS. 2007:273–282. [Google Scholar]

[R5] Beimel Amos, Kasiviswanathan Shiva Prasad, Nissim Kobbi. Bounds on the sample complexity for private learning and private data release. TCC. 2010:437–454. [Google Scholar]

[R6] Blum A, Dwork C, McSherry F, Nissim K. Practical privacy: the sulq framework. PODS. 2005:128–138. [Google Scholar]

[R7] Blum A, Ligett K, Roth A. A learning theory approach to non-interactive database privacy. In: Ladner RE, Dwork C, editors. STOC. ACM; 2008. pp. 609–618. [Google Scholar]

[R8] Bshouty Nader H, Li Yi, Long Philip M. Using the doubling dimension to analyze the generalization of learning algorithms. J Comput Syst Sci. 2009;75(6):323–335. [Google Scholar]

[R9] Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Manuscript. 2006. Learning concept classes with privacy. [Google Scholar]

[R10] Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk minimization. Journal of Machine Learning Research. 2011 [PMC free article] [PubMed] [Google Scholar]

[R11] Chaudhuri Kamalika, Mishra Nina. When random sampling preserves privacy. In: Dwork Cynthia., editor. CRYPTO. Vol. 4117. Springer; 2006. pp. 198–213. Lecture Notes in Computer Science. [Google Scholar]

[R12] Dasgupta Sanjoy. Coarse sample complexity bounds for active learning. NIPS. 2005 [Google Scholar]

[R13] Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. 3rd IACR Theory of Cryptography Conference; 2006. pp. 265–284. [Google Scholar]

[R14] Evfimievski A, Gehrke J, Srikant R. Limiting privacy breaches in privacy preserving data mining. PODS. 2003:211–222. [Google Scholar]

[R15] Yoav Freund H, Seung Sebastian, Shamir Eli, Tishby Naftali. Selective sampling using the query by committee algorithm. Machine Learning. 1997;28(2–3):133–168. [Google Scholar]

[R16] Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008a:265–273. [Google Scholar]

[R17] Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008b:265–273. [Google Scholar]

[R18] Gupta Anupam, Krauthgamer Robert, Lee James R. Bounded geometries, fractals, and low-distortion embeddings. FOCS. 2003:534–543. [Google Scholar]

[R19] Gupta Anupam, Hardt Moritz, Roth Aaron, Ullman Jonathan. Privately releasing conjunctions and the statistical query barrier. STOC. 2011 [Google Scholar]

[R20] Hardt Moritz, Talwar Kunal. On the geometry of differential privacy. STOC. 2010:705–714. [Google Scholar]

[R21] Jones Rosie, Kumar Ravi, Pang Bo, Tomkins Andrew. “i know what you did last summer”: query logs and user privacy. CIKM ‘07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management; New York, NY, USA. ACM; 2007. pp. 909–914. doi: http://doi.acm.org/10.1145/1321440.1321573. [Google Scholar]

[R22] Kasiviswanathan SA, Lee HK, Nissim K, Raskhodnikova S, Smith A. What can we learn privately? Proc of Foundations of Computer Science. 2008 [Google Scholar]

[R23] Kasiviswanathan Shiva Prasad, Smith Adam. A note on differential privacy: Defining resistance to arbitrary side information. CoRR. 2008 abs/0803.3946. [Google Scholar]

[R24] Kolmogorov A, Tikhomirov V. ε-entropy and ε-capacity of sets in function spaces. Translations of the American Mathematical Society. 1961;17:277–364. [Google Scholar]

[R25] Long PM. On the sample complexity of pac learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks. 1995;6(6):1556–1559. doi: 10.1109/72.471352. [DOI] [PubMed] [Google Scholar]

[R26] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-diversity: Privacy beyond k-anonymity. Proc of ICDE. 2006 [Google Scholar]

[R27] McAllester David A. Some pac-bayesian theorems. COLT. 1998:230–234. [Google Scholar]

[R28] McSherry Frank, Mironov Ilya. Differentially private recommender systems: building privacy into the net. KDD ‘09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining; New York, NY, USA. ACM; 2009. pp. 627–636. doi: http://doi.acm.org/10.1145/1557019.1557090. [Google Scholar]

[R29] McSherry Frank, Talwar Kunal. Mechanism design via differential privacy. FOCS. 2007:94–103. [Google Scholar]

[R30] Narayanan Arvind, Shmatikov Vitaly. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy; Oakland, CA, USA. May 2008; IEEE Computer Society; pp. 111–125. [Google Scholar]

[R31] Nissim Kobbi, Raskhodnikova Sofya, Smith Adam. Smooth sensitivity and sampling in private data analysis. In: Johnson David S, Feige Uriel., editors. STOC. ACM; 2007. pp. 75–84. [Google Scholar]

[R32] Pollard D. Convergence of Stochastic Processes. Springer-Verlag; 1984. [Google Scholar]

[R33] Roth Aaron. Differential privacy and the fat-shattering dimension of linear queries. APPROX-RANDOM. 2010:683–695. [Google Scholar]

[R34] Sweeney L. k-anonymity: a model for protecting privacy. Int J on Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 [Google Scholar]

[R35] Vapnik VN, Chervonenkis AY. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications. 1971;16(2):264–280. [Google Scholar]

[R36] Wang Rui, Li Yong Fuga, Wang XiaoFeng, Tang Haixu, Zhou Xiao yong. Learning your identity and disease from research papers: information leaks in genome wide association study. ACM Conference on Computer and Communications Security; 2009. pp. 534–544. [Google Scholar]

[R37] Yao Andrew Chi-Chih. Protocols for secure computations (extended abstract) FOCS. 1982:160–164. [Google Scholar]

[R38] Zhou Shuheng, Ligett Katrina, Wasserman Larry. Differential privacy with compression. Proc of ISIT. 2009 [Google Scholar]

PERMALINK

Sample Complexity Bounds for Differentially Private Learning

Kamalika Chaudhuri

Daniel Hsu

Abstract

1. Introduction

1.1. Related work

Other related work on privacy

2. Preliminaries

2.1. Privacy model

Definition 1

2.2. Learning model

2.3. Privacy-preserving classification

Definition 2

Definition 3

2.4. Additional definitions and notation

Metric spaces, doubling dimension, covers, and packings

Disagreement metrics

Datasets and empirical error

3. Lower bounds for learning with α-privacy

Definition 4

Theorem 5

Proof

4. Upper bounds for learning with α-privacy

Definition 6

4.1. Algorithm

Figure 1.

4.2. Privacy and learning guarantees

Theorem 7

Proof

Theorem 8

Figure 3.

Theorem 9

4.3. Examples

Example 1

Example 2

Example 3

5. Lower bounds for learning with α-label privacy

5.1. Main results

Theorem 10

Lemma 11

Theorem 12

5.2. Example: linear separators in ℝn

Figure 2.

Acknowledgments

Appendix A. Metric spaces

Lemma 13 (Kolmogorov and Tikhomirov, 1961)

Lemma 14 (Gupta, Krauthgamer, and Lee, 2003)

Lemma 15

Proof

Appendix B. Uniform convergence

Lemma 16 (Vapnik and Chervonenkis, 1971)

Lemma 17

Proof

Appendix C. Proofs from Section 4

C.1. Some lemmas

Lemma 18 (McSherry and Talwar, 2007)

Proof

Lemma 19

Proof

C.2. Proof of Theorem 8

C.3. Proof of Theorem 9

C.4. Examples

Lemma 20

Proof

Lemma 21

Proof

Appendix D. Proofs from Section 5

D.1. Some lemmas

Lemma 22

Proof

Lemma 23

Proof

D.2. Proof of Theorem 12

D.3. Proof of Lemma 11

D.4. Proof of Theorem 10

Lemma 24

Proof

Proof [Proof of Theorem 10]

Case 1: m < 1/(4ε)

5.2. Example: linear separators in ℝⁿ

Case 3: (CV/ε) log(C/ε) ≤ m < log(2^d−1 − 1)/(32αε)