Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 2.
Published in final edited form as: JMLR Workshop Conf Proc. 2011;2011:155–186.

Sample Complexity Bounds for Differentially Private Learning

Kamalika Chaudhuri 1, Daniel Hsu 2
PMCID: PMC4183222  NIHMSID: NIHMS464313  PMID: 25285183

Abstract

This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years.

A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008).

We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution Inline graphic chosen independently of the sensitive data. Given such a reference Inline graphic, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between Inline graphic and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy – namely, that the only labels (and not the unlabeled parts of the examples) be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.

Keywords: Privacy, generalization, PAC-learning

1. Introduction

As increasing amounts of personal data is collected, stored and mined by companies and government agencies, the question of how to learn from sensitive datasets while still maintaining the privacy of individuals in the data has become very important. Over the last few years, the notion of differential privacy (Dwork et al., 2006) has received a significant amount of attention, and has become the de facto standard for privacy-preserving computation. In this paper, we study the problem of learning a classifier from a dataset, while simultaneously guaranteeing differential privacy of the training data.

The key issue in differentially-private computation is that given a certain amount of resources, there is usually a tradeoff between privacy and utility. In classification, a natural measure of utility is the classification accuracy, and data is a scarce resource. Thus, a key question in differentially-private learning is: how many examples does a learning algorithm need to guarantee a certain level of privacy and accuracy? In this paper, we study this question from an information-theoretic perspective – namely, we are concerned with the sample complexity, and not the computational complexity of the learner.

This question was first considered by Kasiviswanathan et al. (2008), who studied the case of finite hypothesis classes, as well as the case of discrete data domains. They showed that in these two cases, one can obtain any given privacy guarantee and generalization error, regardless of the unlabeled data distribution with a modest increase in the worst-case sample requirement.

In this paper, we consider the sample complexity of differentially private learning in the context of infinite hypothesis classes on continuous data distributions. This is a very general class of learning problems, and includes many popular machine-learning tasks such as learning linear classifiers when the examples have real-valued features, which cannot be modeled by finite hypothesis classes or hypothesis classes over discrete data domains.

Surprisingly, we show that the results of Kasiviswanathan et al. (2008) do not extend to infinite hypothesis classes on continuous data distributions. As an example, consider the class of thresholds on the unit interval. This simple learning problem has VC dimension 1, and thus for all unlabeled data distributions, it can be learnt (non-privately) with error ε given at most O(1ε) examples1. We show that even for this very simple hypothesis class, any algorithm that uses a bounded number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions.

The key intuition behind our proof is that if most of the unlabeled data is concentrated in a small region around the best classifier, then, even slightly perturbing the best classifier will result in a large classification error. As the process of ensuring differential privacy necessarily involves some perturbation – see, for example, Dwork et al. (2006), unless the algorithm has some prior public knowledge about the data distribution, the number of samples required to learn privately grows with growing concentration of the data around the best classifier.

How can we then learn privately in infinite hypothesis classes over continuous data distributions? One approach is to use some prior information about the data distribution that is known independently of the sensitive data. Another approach is to relax the privacy requirements. In this paper, we examine both approaches.

First, we consider the case when the learner has access to some prior information on the unlabeled data. In particular, the learner knows a reference distribution Inline graphic that is close to the unlabeled data distribution. Similar assumptions are common in Bayesian learning, and PAC-Bayes style bounds have also been studied in the learning theory literature, for example, by McAllester (1998).

Under this assumption, we provide an algorithm for learning with α-privacy, excess generalization error ε, and confidence 1 − δ, using O(dUlog(κ/ε)(1ε2+1εα)) samples. Here α is a privacy parameter (where, lower α implies a stronger privacy guarantee), Inline graphic is the reference distribution, dInline graphic is the doubling dimension of its disagreement metric (Bshouty et al., 2009), and κ is a smoothness parameter that we define. The quantity dInline graphic measures the complexity of the hypothesis class with respect to Inline graphic (see (Bshouty et al., 2009) for a discussion), and we assume that it is finite. The smoothness parameter measures how close the unlabeled data distribution is to Inline graphic (smaller κ means closer), and is motivated by notions of closeness used in Dasgupta (2005) and Freund et al. (1997). Thus the sample requirement of our algorithm grows with increasing distance between Inline graphic and the unlabeled data distribution. Our algorithm works in the non-realizable case, that is, when no hypothesis in the class has zero error; using standard techniques, a slightly better bound of O(dUlog(κ/ε)εα) can be obtained in the realizable setting. However, like the results of Kasiviswanathan et al. (2008), our algorithm is computationally inefficient in general.

The main difficulty in reducing the differentially-private learning algorithms of Kasiviswanathan et al. (2008) to infinite hypothesis classes on continuous data distributions is in finding a suitable finite cover of the class with respect to the unlabeled data. This issue is specific to our particular problem: for non-private learning, a finite cover can always be computed based on the (sensitive) data, and for finite hypothesis classes, the entire class is a cover. The main insight behind our upper bound is that when the unlabeled distribution Inline graphic is close to the reference distribution Inline graphic, then a cover of Inline graphic is also a possibly coarser cover of Inline graphic. Since one can compute a private cover of Inline graphic independent of the sensitive data, we simply compute a finer cover of Inline graphic, and learn over this fine cover using standard techniques such as the exponential mechanism (McSherry and Talwar, 2007).

Next we relax the privacy requirement by requiring only label privacy. In other words, we assume that the unlabeled part of the examples are not sensitive, and the only private information is the labels. This setting was considered by Chaudhuri et al. (2006). An example when this may be applicable is in predicting income from public demographic information. Here, while the label (income) is private, the demographic information of individuals, such as education, gender, and age may be public.

In this case, we provide lower bounds to characterize the sample requirement of label-private learning. We show two results, based on the value of α and ε. For small ε and α (that is, for high privacy and accuracy) we show that any learning algorithm for a given hypothesis class that guarantees α-label privacy and ε accuracy necessarily requires at least Ω(dαε) examples. Here d is the doubling dimension of the disagreement metric at a certain scale, and is a measure of the complexity of the hypothesis class on the unlabeled data distribution. This bound holds when the hypothesis class has finite VC dimension. For larger α and ε, our bounds are weaker but more general; we show a lower bound of Ω(dα) on the sample requirement that holds for any α and ε, and do not require the VC dimension of the hypothesis class to be finite. Here d′ is the doubling dimension of the disagreement metric at a certain scale.

The main idea behind our stronger label privacy lower bounds is to show that differentially private learning algorithms necessarily perform poorly when there is a large set of hypotheses such that every pair in the set labels approximately 1/α examples differently. We then show that such large sets can be constructed when the doubling dimension of the disagreement metric of the hypothesis class with respect to the data distribution is high.

How do these results fit into the context of non-private learning? For non-private learning, sample requirement bounds based on the doubling dimension of the disagreement metric has been extensively studied by (Bshouty et al., 2009); in the realizable case, they show an upper bound of O(d¯ε) for learning with accuracy ε, where is again the doubling dimension of the disagreement metric at a certain scale. These bounds are incomparable to ours in general, as the doubling dimensions in the two bounds are with respect to different scales; however, we can compare them for hypothesis classes and data distributions for which the doubling dimension of the disagreement metric is equal at all scales. An example is learning half spaces with respect to the uniform distribution on the sphere. For such problems, on the upper bound side, we need a factor of O(log(κ/ε)α) times more examples to learn with α-privacy. On the other hand, our lower bounds indicate that for small α and ε, even if we only want α-label privacy, the sample requirement can be as much as a factor of Ω(1α) more than the upper bound for non-private learning.

Finally, one may be tempted to think that we can always discretize a data domain or a hypothesis class, and therefore in practice we are likely to only learn finite hypothesis classes or over discrete data domains. However, there are several issues with such discretization. First, if we discretize either the hypothesis class or the data, then the sample requirement of differentially private learning algorithms will grow as the discretization grows finer, instead of depending on intrinsic properties of the problem. Second, as our α-privacy lower bound example shows, indiscriminate discretization without prior knowledge of the data can drastically degrade the performance of the best classifier in a class. Finally, infinite hypothesis classes and continuous data domains provide a natural abstraction for designing many machine learning algorithms, such as those based on convex optimization or differential geometry. Understanding the limitations of differentially private learning on such hypothesis classes and data domains is useful in designing differentially private approximations to these algorithms.

The rest of our paper is organized as follows. In Section 2, we define some preliminary notation, and explain our privacy model. In Section 3, we present our α-privacy lower bound. Our α-privacy upper bound is provided in Section 4. In Section 5, we provide some lower bounds on the sample requirement of learning with α-label privacy. Finally, the proofs of most of our results are in the appendix.

1.1. Related work

The work most related to ours is Kasiviswanathan et al. (2008), Blum et al. (2008) and Beimel et al. (2010), each of which deals with either finite hypothesis classes or discrete data domains.

Kasiviswanathan et al. (2008) initiated the study of the sample requirement of differentially-private learning. They provided a (computationally inefficient) α-private algorithm that learns any finite hypothesis class Inline graphic with error at most ε using at most O(logHαε) examples in the realizable case. For the non-realizable case, they provided an algorithm with a sample requirement of O(logH·(1αε+1ε2)). Moreover, using a result from Blum et al. (2008), they provided a computationally inefficient α-private algorithm that learns a hypothesis class with VC-dimension V and data dimension n with at most O(nVαε3) examples, provided the data domain is {−1, 1}n. The latter result does not apply when the data is drawn from a continuous distribution; moreover, their results cannot be directly extended to the continuous case.

The first work to study lower bounds on the sample requirement of differentially private learning was Beimel et al. (2010). They show that any α-private algorithm that selects a hypothesis from a specific set Cε requires at least Ω̃(log(|Cε|)/α) samples to achieve error ε. Here Cε is an ε-cover as well as an ε-packing of the hypothesis class Inline graphic with respect to every distribution over the discrete data domain. They also show an upper bound of Õ(log(|Cε|)/(αε)). Such a cover Cε does not exist for continuous data domains; as a result their upper bounds do not apply to our setting. Moreover, unlike our lower bounds, their lower bound only applies to algorithms of a specific form (namely, those that output a hypothesis in Cε), and it also does not apply when we only require the labels to be private.

For the setting of label privacy, Chaudhuri et al. (2006) show an upper bound for PAC-learning in terms of the VC dimension of the hypothesis class. We show a result very similar to theirs in the appendix for completeness, and we show lower bounds for learning with label-privacy which indicate that their bounds are almost tight, in terms of the dependence on α and ε.

Zhou et al. (2009) study some issues in defining differential privacy when dealing with continuous outcomes; however, they do not consider the question of learning classifiers on such data.

Finally, a lot of our work uses tools from the theory of generalization bounds. In particular, some of our upper and lower bounds are inspired by Bshouty et al. (2009), which bounds the sample complexity of (non-private) classification in terms of the doubling dimension of the disagreement metric.

Other related work on privacy

The issue of privacy in data analysis of sensitive information has long been a source of problems for curators of such data, and much of this is due to the realization that many simple and intuitive mechanisms designed to protect privacy are simply ineffective. For instance, the work of Narayanan and Shmatikov (2008) showed that an anonymized dataset released by Netflix revealed enough information so that an adversary, by knowing just a few of the movies rated by a particular user, would be able to uniquely identify such a user in the data set and determine all of his movie ratings. Similar attacks have been demonstrated on private data in other domains as well including social networks (Backstrom et al., 2007) and search engine query logs (Jones et al., 2007). Even releasing coarse statistics without proper privacy safeguards can be problematic. This was recently shown by Wang et al. (2009) in the context of genetic data, where a correlation matrix of genetic markers compiled from a group of individuals contained enough clues to uniquely pinpoint individuals in the dataset and learn of their private information, such as whether or not they had certain diseases.

In order to reason about privacy guarantees (or lack thereof), we need a formal definition of what it means to preserve privacy. In our work, we adopt the notion of differential privacy due to Dwork et al. (2006), which has over the last few years gained much popularity. Differential privacy is known to be a very strong notion of privacy: it has strong semantic guarantees (Kasiviswanathan and Smith, 2008) and is resistant to attacks that many earlier privacy definitions are susceptible to (Ganta et al., 2008b).

There has been a significant amount of work on differential privacy applied to a wide variety of data analysis tasks (Dwork et al., 2006; Chaudhuri and Mishra, 2006; Nissim et al., 2007; Barak et al., 2007; McSherry and Mironov, 2009). Some work that is relevant to ours include Blum et al. (2008), which provides a general method for publishing datasets on discrete data domains while preserving differential privacy so that the answers to queries from a function class with bounded VC dimension will be approximately preserved after the applying the sanitization procedure. More work on this line includes Roth (2010) and Gupta et al. (2011). A number of learning algorithms have also been suitably modified to guarantee differential privacy. For instance, both the classes of statistical query algorithms and the class of methods based on L2-regularized empirical risk minimization with certain types of convex losses can be made differentially private (Blum et al., 2005; Chaudhuri et al., 2011).

There has also been some prior work on providing lower bounds on the loss of accuracy that any differentially private mechanism would suffer; much of this work is in the context of releasing answers to some a of queries made on a database of n individuals. The first such work is by (Blum et al., 2008), which shows that no differentially private mechanism can hope to release with a certain amount of accuracy the answer to a number of median queries when the data lies on a real line. This result is similar in spirit to our Theorem 5, but applies to a much harder problem, namely data release. Other relevant work includes (Hardt and Talwar, 2010), which uses a packing argument similar to ours to provide a lower bound on the amount of noise any differentially private mechanism needs to add to the answer to k linear queries on a database of n people.

There has also been a significant amount of prior work on privacy-preserving data mining (Agrawal and Srikant, 2000; Evfimievski et al., 2003; Sweeney, 2002; Machanavajjhala et al., 2006), which spans several communities and uses privacy models other than differential privacy. Many of the models used have been shown to be susceptible to various attacks, such as composition attacks, where the adversary has some amount of prior knowledge (Ganta et al., 2008a). An alternative line of privacy work is in the Secure Multiparty Computation setting due to Yao (1982), where the sensitive data is split across several adversarial databases, and the goal is to compute a function on the union of these databases. This is in contrast with our setting, where a single centralized algorithm can access the entire dataset.

2. Preliminaries

2.1. Privacy model

We use the differential privacy model of Dwork et al. (2006). In this model, a private database DB ⊆ Inline graphic consists of m sensitive entries from a domain Inline graphic; each entry in DB is a record about an individual (e.g., their medical history) that one wishes to keep private.

The database DB is accessed by users through a sanitizer M. The sanitizer, a randomized algorithm, is said to preserve differential privacy if the value of any one individual in the database does not significantly alter the output distribution of M.

Definition 1

A randomized mechanism M guarantees α-differential privacy if, for all databases DB1 and DB2 that differ by the value of at most one individual, and for every set G of possible outputs of M,

PrM[M(DB1)G]PrM[M(DB2)G]·eα.

We emphasize that the probability in the definition above is only with respect to the internal randomization of the algorithm; it is independent of all other random sources, including any that may have generated the values of the input database.

Differential privacy is a strong notion of privacy (Dwork et al., 2006; Kasiviswanathan and Smith, 2008; Ganta et al., 2008b). In particular, if a sanitizer M ensures α-differential privacy, then, an adversary who knows the private values of all the individuals in the database except for one and has arbitrary prior knowledge about the value of the last individual, cannot gain additional confidence about the private value of the last individual by observing the output of a differentially private sanitizer. The level of privacy is controlled by α, where a lower value of α implies a stronger guarantee of privacy.

2.2. Learning model

We consider a standard probabilistic learning model for binary classification. Let Inline graphic be a distribution over Inline graphic × {±1}, where Inline graphic is the data domain and {±1} are the possible labels. We use Inline graphic to denote the marginal of Inline graphic over the data domain Inline graphic. The classification error of a hypothesis h: Inline graphic → {±1} with respect to a data distibution Inline graphic is

Pr(x,y)~P[h(x)y].

We denote by S ~ Inline graphic an i.i.d. draw of m labeled examples S = {(x1, y1), …, (xm, ym)} ⊆ Inline graphic ×{±1} from the distribution Inline graphic. This process can equivalently be seen as drawing an unlabeled sample X:= {x1, …, xm} from the marginal Inline graphic, and then, for each xX, drawing the corresponding label y from the induced conditional distribution.

A learning algorithm is given as input a set of m labeled examples S ~ Inline graphic, a target accuracy parameter ε ∈ (0, 1), and target confidence parameter δ ∈ (0, 1). Its goal is to return a hypothesis h: Inline graphic → {±1} such that its excess generalization error with respect to a specified hypothesis class Inline graphic

Pr(x,y)~P[h(x)y]-infhHPr(x,y)~P[h(x)y]

is at most ε with probability at least 1 − δ over the random choice of the sample S ~ Inline graphic, as well as any internal randomness of the algorithm.

We also occasionally adopt the realizable assumption (with respect to Inline graphic). The realizable assumption states that there exists some h*Inline graphic such that Pr(x,y)~ Inline graphic[h*(x) ≠ y] = 0. In this case, the excess generalization error of a hypothesis h is simply its classification error. Without the realizable assumption, there may be no classifier in the hypothesis class Inline graphic with zero classification error, and we refer to this as the non-realizable case.

2.3. Privacy-preserving classification

In privacy-preserving classification, we assume that the database is a training dataset drawn in an i.i.d manner from some data distribution Inline graphic, and that the sanitization mechanism is a learning algorithm that outputs a classifier based on the training data. In this paper, we consider two possible privacy requirements on our learning algorithms.

Definition 2

A randomized learning algorithm Inline graphic guarantees α-label privacy ( Inline graphic is α-label private) if, for any two datasets S1 = {(x1, y1), …, (xm−1, ym−1), (xm, ym)} and S2={(x1,y1),,(xm-1,ym-1),(xm,ym)} differing in at most one label ym, and any set of outputs G of Inline graphic,

PrA[A(S1)G]PrA[A(S2)G]·eα.

Definition 3

A randomized learning algorithm Inline graphic guarantees α-privacy ( Inline graphic is α-private) if, for any two datasets S1 = {(x1, y1), …, (xm−1, ym−1), (xm, ym)} and S2={(x1,y1),,(xm-1,ym-1),(xm,ym)} differing in at most one example ( xm,ym), and any set of outputs G of Inline graphic,

PrA[A(S1)G]PrA[A(S2)G]·eα.

Note that if the input dataset S is a random variable, then for any value S′Inline graphic × {±1} in the range of S, the conditional probability distribution of Inline graphic(S) | S = S′ is determined only by the algorithm Inline graphic and the value S′; it is independent of the distribution of the random variable S. Therefore, for instance,

PrS,A[A(S)GS=S]=PrA[A(S)G].

for any S′Inline graphic × {±1} and any set of outputs G.

The difference between the two notions of privacy is that for α-label privacy, the two databases can differ only in the label of one example; whereas for α-privacy, the two databases differ can differ in a complete example (both labeled and unlabeled parts). Thus, α-label privacy only ensures the privacy of the label component of each example; it makes no guarantees about the unlabeled part. If a classification algorithm guarantees α-privacy, then it also guarantees α-label privacy. Thus α-label privacy is a weaker notion of privacy than α-privacy.

The notion of label privacy was also considered by Chaudhuri et al. (2006), who provided an algorithm for learning with label privacy. For strict privacy, one would require the learning algorithm to guarantee α-privacy; however, label privacy may also be an useful notion. For example, if the data x represents public demographic information (e.g., age, zip code, education), while the label y represents income level, an individual may consider the label to be private but may not mind if others can infer her demographic information (which could be relatively public already) by her inclusion in the database.

Thus, the goal of a α-private (resp. α-label private) learning algorithm is as follows. Given a dataset S of size m, a privacy parameter α, a target accuracy ε, and a target confidence parameter δ:

  1. guarantee α-privacy (resp. α-label privacy) of the training dataset S;

  2. with probability at least 1 − δ over both the random choice of S ~ Inline graphic and the internal randomness of the algorithm, return a hypothesis h: Inline graphic → {±1} with excess generalization error
    Pr(x,y)~P[h(x)y]-infhHPr(x,y)~P[h(x)y]ε.

2.4. Additional definitions and notation

We now present some additional essential definitions and notation.

Metric spaces, doubling dimension, covers, and packings

A metric space ( Inline graphic, ρ) is a tuple, where Inline graphic is a set of elements, and ρ is a distance function from Inline graphic × Inline graphic to {0} ∪ ℝ+. Let ( Inline graphic, ρ) be any arbitrary metric space. For any zInline graphic and r > 0, let B(z, r) = {z′Inline graphic: ρ(z, z′) ≤ r} denote the ball centered at z of radius r.

The diameter of ( Inline graphic, ρ) is sup{ρ(z, z′): z, z′Inline graphic}, the longest distance in the space. An ε-cover of ( Inline graphic, ρ) is a set CInline graphic such that for all zInline graphic, there exists some z′C such that ρ(z, z′) ≤ ε. An ε-packing of ( Inline graphic, ρ) is a set PInline graphic such that ρ(z, z′) > ε for all distinct z, z′P. Let Inline graphic ( Inline graphic, ρ) denote the size of the smallest ε-cover of ( Inline graphic, ρ).

We define the doubling dimension of ( Inline graphic, ρ) at scale ε, denoted as ddimε( Inline graphic, ρ), as the smallest number d such that each ball B(z, ε) ⊆ Inline graphic of radius ε can be covered by at most ⌊2d⌋ balls of radius ε/2, i.e. there exists z1, …, z⌊2dInline graphic such that B(z, ε) ⊆ B(z1, ε/2) ∪ … ∪ B(z⌊2d, ε/2). Notice that ddimε( Inline graphic, ρ) may increase or decrease with ε. The doubling dimension of ( Inline graphic, ρ) is sup{ddimr( Inline graphic, ρ): r > 0}.

Disagreement metrics

The disagreement metric of a hypothesis class Inline graphic with respect to a data distribution Inline graphic over Inline graphic is the metric ( Inline graphic, ρInline graphic), where ρInline graphic is the following distance function:

ρD(h,h):=Prx~D[h(x)h(x)].

The empirical disagreement metric of a hypothesis class Inline graphic with respect to a data distribution Inline graphic over Inline graphic is the metric ( Inline graphic, ρX), where ρX is the following distance function:

ρX(h,h):=1XxXI[h(x)h(x)].

The disagreement metric (or empirical disagreement metric) is the proportion of unlabeled examples on which h and h′ disagree with respect to Inline graphic (or the uniform distribution over X). We use the notation BInline graphic (h, r) to denote the ball centered at h of radius r with respect to ρInline graphic, and BX(h, r) to denote the ball centered at h of radius r with respect to ρX.

Datasets and empirical error

For an unlabeled dataset XInline graphic and a hypothesis h: Inline graphic → {±1}, we denote by SX,h:= {(x, h(x)): xX} the labeled dataset induced by labeling X with h. The empirical error of a hypothesis h: Inline graphic → {±1} with respect to a labeled dataset SInline graphic × {±1} is err(h, S):= (1/|S|(x,y) ∈S Inline graphic [h(x) ≠ y] the average number of mistakes that h makes on S; note that ρX(h, SX,h′) = err(h, SX,h′). Finally, we informally use the Õ(·) notation to hide log(1/δ) factors, as well as factors that are logarithmic in those that do appear.

3. Lower bounds for learning with α-privacy

In this section, we show a lower bound on the sample requirement of learning with α-privacy. In particular, we show an example that illustrates that when the data is drawn from a continuous distribution, for any M, all α-private algorithms that are supplied with at most M examples fail to output a good classifier for at least one unlabeled data distribution.

Our example hypothesis class is the class of thresholds on [0, 1]. This simple class has VC dimension 1, and thus can be learnt non-privately with classification error ε given only Õ(1/ε) examples, regardless of the unlabeled data distribution. However, Theorem 5 shows that even in the realizable case, for every α-private algorithm that is given a bounded number of examples, there is at least one unlabeled data distribution on which the learning algorithm produces a classifier with error 15, with probability at least 1/2 over its own random coins.

The key intuition behind our example is that if most of the unlabeled data is concentrated in a small region around the best classifier, then, even slightly perturbing the best classifier will result in a large classification error. As the process of ensuring differential privacy necessarily involves some perturbation, unless the algorithm has some prior public knowledge about the data distribution, the number of samples required to learn privately grows with growing concentration of the data around the best classifier. As illustrated by our theorem, this problem is not alleviated if the support of the unlabeled distribution is known; even if the data distribution has large support, a large fraction of the data can still lie in a region close to the best classifier.

Before we describe our example in detail, we first need a definition.

Definition 4

The class of thresholds on the unit interval is the class of functions hw: [0, 1] → {−1, 1} such that:

hw(x):={1ifxw-1otherwise.

Theorem 5

Let M > 2 be any number, and let Inline graphic be the class of thresholds on the unit interval [0, 1]. For any α-private algorithm A that outputs a hypothesis hInline graphic, there exists a distribution Inline graphic on labeled examples with the following properties:

  1. There exists a threshold h*Inline graphic with classification error 0 with respect to Inline graphic.

  2. For all samples S of size mM drawn from Inline graphic, with probability at least 1/2 over the random coins of A, the hypothesis output by A(S) has classification error at least 15 with respect to Inline graphic.

  3. The marginal Inline graphic of Inline graphic over the unlabeled data has support [0, 1].

Proof

Let η=16+4exp(αM), and let Inline graphic denote the uniform distribution over [0, 1]. Let Z = {η, 2η, …, }, where K = ⌊1/η⌋ − 1. We let Gz = [zη/3, z + η/3] for zZ, and let Inline graphic Inline graphic be the subset of thresholds: Inline graphic = {hτGz}. We note that Gz ⊆ [0, 1] for all zZ.

For each zZ, we define a distribution Inline graphic over labeled examples as follows. First, we describe the marginal Inline graphic of Inline graphic over the unlabeled data. A sample from Inline graphic is drawn as follows. With probability 12, x is drawn from Inline graphic; with probability 12, it is drawn from uniformly from Gz. Now, an unlabeled example x drawn from Inline graphic is labeled positive if x ≥ z, and negative otherwise. We observe that for every such distribution Inline graphic, there exists a threshold, namely, hz that has classification error 0; in addition, the support of Inline graphic is [0, 1]. Moreover, there are 1η-1 such distributions Inline graphic in all, and 1η-15.

We say that an α-private algorithm A succeeds on a sample S with respect to a distribution Inline graphic if with probability 12 over the random coins of A, the hypothesis output by A(S) has classification error <15 over Inline graphic.

Suppose for the sake of contradiction that there exists an α-private algorithm A* such that for all distributions Inline graphic, there is at least one sample S of size ≤ M drawn from Inline graphic such that A* succeeds on S with respect to Inline graphic. Then, for all zZ, and for all Inline graphic, there exists a sample Sz of size m ≤ M drawn from Inline graphic such that A* succeeds on Sz with respect to Inline graphic.

By construction, the Gz’s are disjoint, so

PrA[A(Sz)Gz]zZ\{z}PrA[A(Sz)Gz]. (1)

Furthermore, any Sz differs from Sz′ by at most m labeled examples, so because A* is α-private, Lemma 22 that implies for any z′,

PrA[A(Sz)Gz]e-αmPrA[A(Sz)Gz]. (2)

If A*(Sz′) lies outside Inline graphic, A*(Sz′) classifies at least 1/4 fraction of the examples from Inline graphic incorrectly, and thus A* cannot succeed on Sz′ with respect to Inline graphic. Therefore, by the assumption on A*, for any z′,

PrA[A(Sz)Gz]12. (3)

Combining Equations (1), (2), and (3) gives the inequality

PrA[A(Sz)Gz]e-αm·zZ\{z}12e-αm·(1η-2)·12.

Since m ≤ M, the quantity on the RHS of the above equation is more than 23. A* therefore does not succeed on Sz with respect to Inline graphic, thus leading to a contradiction.

4. Upper bounds for learning with α-privacy

In this section, we show an upper bound on the sample requirement of learning with α-privacy by presenting a learning algorithm that works on infinite hypothesis classes over continuous data domains, under certain conditions on the hypothesis class and the data distribution. Our algorithm works in the non-realizable case, that is, when there may be no hypothesis in the target hypothesis class with zero classification error.

A natural way to extend the algorithm of Kasiviswanathan et al. (2008) to an infinite hypothesis class Inline graphic is to compute a suitable finite subset Inline graphic of Inline graphic that contains a hypothesis with low excess generalization error, and then use the exponential mechanism of McSherry and Talwar (2007) on Inline graphic. To ensure that a hypothesis with low error is indeed in Inline graphic, we would like Inline graphic to be an ε-cover of the disagreement metric ( Inline graphic, ρInline graphic). In a non-private or label-private learning, we can compute such a Inline graphic directly based on the unlabeled training examples; in our setting, the training examples themselves are sensitive, and this approach does not directly apply.

The key idea behind our algorithm is that instead of using the sensitive data to compute Inline graphic, we can use a reference distribution Inline graphic that is known independently of the sensitive data. For instance, if the domain of the unlabeled data is bounded, then a reasonable choice for Inline graphic is the uniform distribution over the domain. Our key observation is that if Inline graphic is close to the unlabeled data distribution Inline graphic according to a certain measure of closeness inspired by Dasgupta (2005) and Freund et al. (1997), then a cover of the disagreement metric on Inline graphic with respect to Inline graphic is a (possibly coarser) cover of the disagreement metric on Inline graphic with respect to Inline graphic. Thus we can set Inline graphic to be a fine cover of ( Inline graphic, ρInline graphic), and this cover can be computed privately as it is independent of the sensitive data.

Our algorithm works when the doubling dimension of ( Inline graphic, ρInline graphic) is finite; under this condition, there is always such a finite cover Inline graphic. We note that this is a fairly weak condition that is satisfied by many hypothesis classes and data distributions. For example, any hypothesis class with finite VC dimension will satisfy this condition for any unlabeled data distribution Inline graphic.

Finally, it may be tempting to think that one can further improve the sample requirement of our algorithm by using the sensitive data to privately refine a cover of ( Inline graphic, ρInline graphic) to a cover of ( Inline graphic, ρInline graphic). However, our calculations show that naively refining such a cover leads to a much higher sample requirement.

We now define our notion of closeness.

Definition 6

We say that a data distribution Inline graphic is κ-smooth with respect to a distribution Inline graphic for some κ ≥ 1, if for all measurable sets AInline graphic,

Prx~D[xA]κ·Prx~U[xA].

This notion of smoothness is very similar to, but weaker than the notions of closeness between distributions that have been used by (Dasgupta, 2005; Freund et al., 1997). We note that if Inline graphic is absolutely continuous with respect to Inline graphic (i.e., Inline graphic assigns zero probability to a set only if Inline graphic does also), then Inline graphic is κ-smooth with respect to Inline graphic for some finite κ.

4.1. Algorithm

Our main algorithm Inline graphic is given in Figure 1. The first step of the algorithm calculates the distance scale at which it should construct a cover of ( Inline graphic, ρInline graphic). This scale is a function of |S|, the size of the input data set S, and can be computed privately because |S| is not sensitive information. A suitable cover of ( Inline graphic, ρInline graphic) that is also a suitable packing of ( Inline graphic, ρInline graphic) is then constructed; note that such a set always exists because of Lemma 13. In the final step, an exponential mechanism (McSherry and Talwar, 2007) is used to select a hypothesis from the cover with low error. As this step of the algorithm is the only one that uses the input data, the algorithm is α-private as long as this last step guarantees α-privacy.

Figure 1.

Figure 1

Learning algorithm for α-privacy.

4.2. Privacy and learning guarantees

Our first theorem states the privacy guarantee of Algorithm Inline graphic.

Theorem 7

Algorithm Inline graphic preserves α-privacy.

Proof

The algorithm only accesses the private dataset S in the final step. Because changing one labeled example in S changes err(g, S) by at most 1, this step is guarantees α-privacy (McSherry and Talwar, 2007).

The next theorem provides an upper bound on the sample requirement of Algorithm Inline graphic. This bound depends on the doubling dimension dInline graphic of ( Inline graphic, ρInline graphic) and the smoothness parameter κ, as well as the privacy and learning parameters α, ε, δ.

Theorem 8

Let Inline graphic be a distribution over Inline graphic × {±1} whose marginal over Inline graphic is Inline graphic. There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If

  1. the doubling dimension dInline graphic of ( Inline graphic, ρInline graphic) is finite,

  2. Inline graphic is κ-smooth with respect to Inline graphic,

  3. SInline graphic × {±1} is an i.i.d. random sample from Inline graphic such that
    SC·(1αε+1ε2)·(dU·logκε+log1δ), (4)
    then with probability at least 1 − δ, the hypothesis hInline graphicInline graphic returned by Inline graphic (S, Inline graphic, α, ε, δ) satisfies
    Pr(x,y)~P[hA(x)y]infhHPr(x,y)~P[h(x)y]+ε.

The proof of Theorem 8 is stated in Appendix C. If we have prior knowledge that some hypothesis in Inline graphic has zero error (the realizability assumption), then the sample requirement can be improved with a slightly modified version of Algorithm Inline graphic. This algorithm, called Algorithm Inline graphic, is given in Figure 3 in Appendix C.

Figure 3.

Figure 3

Learning algorithm for α-label privacy.

Theorem 9

Let Inline graphic be any probability distribution over Inline graphic ×{±1} whose marginal over Inline graphic is Inline graphic. There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If

  1. the doubling dimension dInline graphic of ( Inline graphic, ρInline graphic) is finite,

  2. Inline graphic is κ-smooth with respect to Inline graphic,

  3. SInline graphic × {±1} is an i.i.d. random sample from Inline graphic such that
    SC·1αε·(dU·log(κ/ε)+log1δ), (5)
  4. there exists h*Inline graphic with Pr(x,y) ~ Inline graphic [h*(x) ≠ y] = 0, then with probability at least 1 − δ, the hypothesis hInline graphicInline graphic returned by Inline graphic (S, Inline graphic, α, ε, δ) satisfies
    Pr(x,y)~P[hA(x)y]ε.

Again, the proof of Theorem 9 is in Appendix C.

4.3. Examples

In this section, we give some examples that illustrate the sample requirement of Algorithm Inline graphic.

First, we consider the example from the lower bound given in the proof of Theorem 5.

Example 1

The domain of the data is Inline graphic:= [0, 1], and the hypothesis class is Inline graphic:= Inline graphic = {ht: t ∈ [0, 1]} (recall, ht(x) = 1 if and only if xt). A natural choice for the reference distribution Inline graphic is the uniform distribution over [0, 1]; the doubling dimension of ( Inline graphic, ρInline graphic) is 1 because every interval can be covered by two intervals of half the length. Fix some M > 0 and α ∈ (0, 1), and let η:= 1/(6 + 4 exp(αM)). For z ∈ [η, 1 − η], let Inline graphic be the distribution on [0, 1] with density

pDz(x):={12+34ηifz-η/3xz+η/3,12if0x<z-η/3orz+η/3<x1.

Clearly, Inline graphic is κ-smooth with respect to Inline graphic for κ=12+34η=O(exp(αM)). Therefore the sample requirement of Algorithm Inline graphic to learn with α-privacy and excess generalization error ε is at most

C·(1εα+1ε2)·(αM+log1δ)

which is Õ(M) for constant ε, matching the lower bound from Theorem 5 up to constants.

Next, we consider two examples in which the domain of the unlabeled data Inline graphic:= Inline graphic is the uniform distribution on the unit sphere in ℝn:

Sn-1:={xRn:||x||=1}

and the target hypothesis class Inline graphic:= Inline graphic is the class of linear separators that pass through the origin in ℝn:

Hlinear:={hw:wSn-1}wherehw(x)=1ifandonlyifw·x0.

The examples will consider two different distributions over Inline graphic.

A natural reference data distribution in this setting is the uniform distribution over Inline graphic; this will be our reference distribution Inline graphic. It is known that dInline graphic:= sup{ddimr( Inline graphic, ρInline graphic): r ≥ 0} = O(n) (Bshouty et al., 2009).

Example 2

We consider a case where the unlabeled data distribution Inline graphic is concentrated near an equator of Inline graphic. More formally, for some vector uInline graphic, and γ ∈ (0, 1), we let Inline graphic be uniform over W:= {xInline graphic: |u · x| ≤ γ}; in other words, the unlabeled data lies in a small band of width γ around the equator.

By Lemma 20 (see Appendix C), Inline graphic is κ-smooth with respect to Inline graphic for κ=11-2exp(-ηγ2/2). Thus the sample requirement of Algorithm Inline graphic to learn with α-privacy and excess excess generalization error ε is at most

C·(1αε+1ε2)·(n·log(1ε·11-2exp(-nγ2/2))+log1δ).

When n is large and γ1/n, this bound is O(nαε+nε2), where the Õ notation hides factors logarithmic in 1/δ and 1/ε.

Example 3

Now we consider the case where the unlabeled data lies on two diametrically opposite spherical caps. More formally, for some vector uInline graphic, and γ ∈ (0, 1), we now let Inline graphic be uniform over Inline graphic\W, where W:= {xInline graphic: |u · x| ≤ γ}; in other words, the unlabeled data lies outside a band of width γ around the equator.

By Lemma 21 (see Appendix C), Inline graphic is κ-smooth with respect to Inline graphic for κ=(21-γ)n-12. Thus the sample requirement of Algorithm Inline graphic is to learn with α-privacy and excess generalization error ε is at most:

C·(1αε+1ε2)·(n2·log21-γ+n·log1ε+log1δ).

Thus, for large n and constant γ < 1, the sample requirement of Algorithm Inline graphic is O(n2ε2+n2αε). So, even though the smoothness parameter κ is exponential in the dimension n, the sample requirement remains polynomial in n.

5. Lower bounds for learning with α-label privacy

In this section, we provide two lower bounds on the sample complexity of learning with α-label privacy. Our first lower bound holds when α and ε are small (that is, high privacy and high accuracy), and when the hypothesis class has bounded VC dimension V. If these conditions hold, then we show a lower bound of Ω(d/εα) where d is the doubling dimension of the disagreement metric ( Inline graphic, ρInline graphic) at some scale.

The main idea behind our bound is to show that differentially private learning algorithms necessarily perform poorly when there is a large set of hypotheses such that every pair in the set labels approximately 1/α examples differently. We then show that such large sets can be constructed when the doubling dimension of the disagreement metric ( Inline graphic, ρInline graphic) is high.

5.1. Main results

Theorem 10

There exists a constant c > 0 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V < ∞, Inline graphic be a distribution over Inline graphic, X be an i.i.d. sample from Inline graphic of size m, and Inline graphic be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in Inline graphic. Let d:= ddim12ε ( Inline graphic, ρInline graphic) > 2, and d′:= inf{ddim12r( Inline graphic, ρInline graphic): εr < Δ/6} > 2. If

ε<c·(ΔV(1+log(1/Δ))),αc·(dVlog(1/ε)),andm<c·(dαε)

where Δ is the diameter of ( Inline graphic, ρInline graphic), then there exists a hypothesis h*Inline graphic such that with probability at least 1/8 over the random choice of X and internal randomness of Inline graphic, the hypothesis hInline graphic returned by Inline graphic(SX,h*) has classification error

Prx~D[hA(x)h(x)]>ε.

We note that the conditions on α and ε can be relaxed by replacing the VC dimension with other (possibly distribution-dependent) quantities that determine the uniform convergence of ρX to ρInline graphic; we used a distribution-free parameter to simplify the argument. Moreover, the condition on ε can be reduced to ε < c for some constant c ∈ (0, 1) provided that there exists a lower bound of Ω(V/ε) to (non-privately) learn Inline graphic under the distribution Inline graphic.

The proof of Theorem 10, which is in Appendix D, relies on the following lemma (possibly of independent interest) which gives a lower bound on the empirical error of the hypothesis returned by an α-label private learning algorithm.

Lemma 11

Let XInline graphic be an unlabeled dataset of size m, Inline graphic be a hypothesis class, Inline graphic be a learning algorithm that guarantees α-label privacy, and s > 0. Pick any h0Inline graphic. If P is an s-packing of BX(h0, 4s) ⊆ Inline graphic, and

m<log(P2-1)8αs,

then there exists a subset QP such that

  1. |Q| ≥ |P|/2;

  2. for all hQ, PrInline graphic [ Inline graphic(SX,h) ∉ BX(h, s/2)] ≥ 1/2.

The proof of Lemma 11 is in Appendix D. The next theorem shows a lower bound without restrictions on ε and α. Moreover, this bound also applies when the VC dimension of the hypothesis class is unbounded. However, we note that this bound is weaker in that it does not involve a 1/ε factor, where ε is the accuracy parameter.

Theorem 12

Let Inline graphic be a hypothesis class, Inline graphic be a distribution over Inline graphic, X be an i.i.d. sample from Inline graphic of size m, and Inline graphic be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in Inline graphic. Let d″:= ddim4ε ( Inline graphic, ρInline graphic) ≥ 1. If ε ≤ Δ/2 and

m(d-1)log2α

where Δ is the diameter of ( Inline graphic, ρInline graphic), then there exists h*Inline graphic such that with probability at least 1/2 over the random choice of X and internal randomness of Inline graphic, the hypothesis hInline graphic returned by Inline graphic(SX,h*) has classification error

Prx~D[hA(x)h(x)]>ε.

In other words, any α-label private algorithm for learning a hypothesis in Inline graphic with error at most ε ≤ Δ/2 must use at least (d″ − 1) log(2)/α examples. Theorem 12 uses ideas similar to those in (Beimel et al., 2010), but the result is stronger in that it applies to α-label privacy and continuous data domains. A detailed proof is provided in Appendix D.

5.2. Example: linear separators in ℝn

In this section, we show an example that illustrates our label privacy lower bounds. Our example hypothesis class Inline graphic:= Inline graphic is the class of linear separators over ℝn that pass through the origin, and the unlabeled data distribution Inline graphic is the uniform distribution over the unit sphere Inline graphic. By Lemma 25 (see Appendix D), the doubling dimension of ( Inline graphic, ρInline graphic) at any scale r is at least n − 2. Therefore Theorem 10 implies that if α and ε are small enough, any α-label private algorithm Inline graphic that correctly learns all hypotheses hInline graphic with error ≤ ε requires at least Ω(nεα) examples. (In fact, the condition on ε can be relaxed to ε ≤ c for some constant c ∈ (0, 1), because Ω(n) examples are needed to even non-privately learn in this setting (Long, 1995).) We also observe that this bound is tight (except for a log(1/δ) factor): as the doubling dimension of Inline graphic is at most n, in the realizable case, Algorithm Inline graphic using Inline graphic:= Inline graphic learns linear separators with α-label privacy given O(nαε) examples.

Figure 2.

Figure 2

Learning algorithm for α-privacy under the realizable assumption.

Acknowledgments

KC would like to thank NIH U54 HL108460 for research support. DH was partially supported by AFOSR FA9550-09-1-0425, NSF IIS-1016061, and NSF IIS-713540.

Appendix A. Metric spaces

Lemma 13 (Kolmogorov and Tikhomirov, 1961)

For any metric space ( Inline graphic, ρ) with diameter Δ, and any ε ∈ (0, Δ), there exists an ε-packing of ( Inline graphic, ρ) that is also an ε-cover.

Lemma 14 (Gupta, Krauthgamer, and Lee, 2003)

For any ε > 0 and r > 0, if a metric space ( Inline graphic, ρ) has doubling dimension d and zInline graphic, then every ε-packing of (B(z, r), ρ) has cardinality at most (4r/ε)d.

Lemma 15

Let ( Inline graphic, ρ) be a metric space with diameter Δ, and r ∈ (0, 2Δ). If ddimr( Inline graphic, ρ) ≥ d, then there exists zInline graphic such that B(z, r) has an (r/2)-packing of size at least 2d.

Proof

Fix r ∈ (0, 2Δ) and a metric space ( Inline graphic, ρ) with diameter Δ. Suppose that for every zInline graphic, every (r/2)-packing of B(z, r) has size less than 2d. For each zInline graphic, let Pz be an (r/2)-packing of (B(z, r), ρ) that is also an (r/2)-cover—this is guaranteed to exist by Lemma 13. Therefore, for each zInline graphic, B(z, r) ⊆ ∪z′Pz B(z, r/2), and |Pz| < 2d. This implies that ddimr( Inline graphic, ρ) is less than d.

Appendix B. Uniform convergence

Lemma 16 (Vapnik and Chervonenkis, 1971)

Let Inline graphic be a family of measurable functions f: Inline graphic → {0, 1} over a space Inline graphic with distribution Inline graphic. Denote by Inline graphic[f] the empirical average of f over a subset ZInline graphic. Let εm:= (4/m)(log( Inline graphic (2m)) + log(4/δ)), where Inline graphic (n) is the n-th VC shatter coefficient with respect to Inline graphic. Let Z be an i.i.d. sample of size m from Inline graphic. With probability at least 1 − δ, for all fInline graphic,

E[f]EZ[f]-min{EZ[f]εm,E[f]εm+εm}.

Also, with probability at least 1 − δ, for all fInline graphic,

E[f]EZ[f]+min{E[f]εm,EZ[f]εm+εm}.

Lemma 17

Let Inline graphic be a hypothesis class with VC dimension V. Fix any δ ∈ (0, 1), and let X be an i.i.d. sample of size mV/2 from Inline graphic. Let εm:= (8V log(2em/V) + 4 log(4/δ))/m. With probability at least 1 − δ, for all pairs of hypotheses {h, h′} ⊆ Inline graphic,

ρD(h,h)ρX(h,h)-min{ρX(h,h)εm,ρD(h,h)εm+εm}.

Also, with probability at least 1 − δ, for all pairs of hypotheses {h, h′} ⊆ Inline graphic,

ρX(h,h)ρD(h,h)-ρD(h,h)εm.

Proof

This is an immediate consequence of Lemma 16 as applied to the function class Inline graphic:= {xInline graphic [h(x) ≠ h′ (x)]: h, h′Inline graphic}, which has VC shatter coefficients Inline graphic (2m) Inline graphic (2m)2 (2em/V)2V by Sauer’s Lemma.

Appendix C. Proofs from Section 4

C.1. Some lemmas

We first give two simple lemmas. The first one, Lemma 18 states some basic properties of the exponential mechanism.

Lemma 18 (McSherry and Talwar, 2007)

Let I be a finite set of indices, and let ai ∈ ℝ for all iI. Define the probability distribution p:= (pi: iI) where pi ∝ exp(−ai) for all iI. If jI is drawn at random according to p, then the following holds for any element i0I and any t ∈ ℝ.

  1. Let iI. If ait, then Prjp[j = i] ≤ exp(−(tai0)).

  2. Prj~p[ajai0+ t] ≤ |I| exp(−t).

Proof

Fix any i0I and t ∈ ℝ. To show the first part of the lemma, note that for any iI with ai ≥ t, we have

Prj~p[j=i]=exp(-ai)iIexp(-ai)exp(-t)exp(-ai0)=exp(-(t-ai0)).

For the second part, we apply the inequality from the first part to all iI such that aiai0 + t, so

graphic file with name nihms464313e1.jpg

The next lemma is consequences of smoothness between distributions Inline graphic and Inline graphic.

Lemma 19

If Inline graphic is κ-smooth with respect to Inline graphic, then for all ε > 0, every ε-cover of ( Inline graphic, ρInline graphic) is a κε-cover of ( Inline graphic, ρInline graphic).

Proof

Suppose C is an ε-cover of ( Inline graphic, ρInline graphic). Then, for any hInline graphic, there exists h′ ∈ C such that ρInline graphic(h, h′) ≤ ε. Fix such a pair h, h′, and let A := {xχ : h(x) ≠ h′(x)} be the subset of χ on which h and h′ disagree. As Inline graphic is κ-smooth with respect to Inline graphic, by definition of smoothness,

ρD(h,h)=Prx~D[xA]κ·Prx~D[xA]=κ·ρU(h,h)κε,

and thus C is a κε-cover of ( Inline graphic, ρInline graphic).

C.2. Proof of Theorem 8

First, because of the lower bound on m := |S| from (4), the computed value of κ̂ in the first step of the algorithm must satisfy κ̂κ. Therefore, Inline graphic is also κ̂-smooth with respect to Inline graphic. Combining this with Lemma 19, Inline graphic is an (ε/4)-cover of ( Inline graphic, ρInline graphic). Moreover, as Inline graphic is also an (ε/4κ̂)-packing of Inline graphic, from Lemma 14, the cardinality of Inline graphic is at most | Inline graphic| ≤ (16κ̂/ε)dInline graphic.

Define err(h) := Pr(x,y)~ Inline graphic[h(x) ≠ y]. Suppose that h* ∈ Inline graphic minimizes err(h) over hInline graphic. Let g0Inline graphic be an element of Inline graphic such that ρInline graphic(h*, g0) ≤ ε/4; g0 exists as Inline graphic is an (ε/4)-cover of ( Inline graphic, ρD). By the triangle inequality, we have that:

err(g0)err(h)+ρD(h,g0)err(h)+ε/4 (6)

Let E be the event that maxgInline graphic | err(g) − err(g, S)| > ε/4, and Ē be its complement. By Hoeffding’s inequality, a union bound, and the lower bound on |S|, we have that for a large enough value of the constant C in Equation (4),

PrS~Pm[E]GmaxgGPrS~Pm[err(g)-err(g,S)>ε4]2Gexp(-Sε232)δ2.

In the event Ē, we have err(hInline graphic) ≥ err(hInline graphic, S) − ε/4 and err(g0) ≤ err(g0, S) + ε/4 because both hInline graphic and g0 are in Inline graphic. Therefore,

PrS~Pm,A1[err(hA)>err(h)+ε]PrS~Pm,A1[err(hA)>err(g0)+3ε4]PrS~Pm[E]+PrA1[err(hA)>err(g0)+3ε4E¯]δ2+PrA1[err(hA,S)>err(g0,S)+ε4E¯]δ2+Gexp(-αSε8)δ2+(16κ^ε)dUexp(-αSε8)δ

Here, the first step follows from (7), and the final three inequalities follow from Lemma 18 (using ag = α|S| err(g, S)/2 for gInline graphic), the upper bound on | Inline graphic|, and the lower bound on m in (4).

C.3. Proof of Theorem 9

The proof is very similar to the proof of Theorem 8.

First, because of the lower bound on m := |S| from (5), the computed value of κ̂ in the first step of the algorithm must satisfy κ̂κ. Therefore, Inline graphic is also κ̂-smooth with respect to Inline graphic. Combining this with Lemma 19, as Inline graphic is an (ε/4κ̂)-cover of Inline graphic, Inline graphic is an (ε/4)-cover of ( Inline graphic, ρInline graphic). Moreover, as Inline graphic is also an (ε/4κ̂)-packing of Inline graphic, from Lemma 14, the cardinality of Inline graphic is at most | Inline graphic| ≤ (16κ̂/ε)dInline graphic.

Define err(h) := Pr(x,y)~ Inline graphic[h(x) ≠ y]. Suppose that h* ∈ Inline graphic minimizes err(h) over hInline graphic. Recall that from the realizability assumption, err(h*) = 0. Let g0Inline graphic be an element of Inline graphic such that ρdcl12;(h*, g0) ≤ ε/4; g0 exists as Inline graphic is an (ε/4)-cover of ( Inline graphic, ρInline graphic). By the triangle inequality, we have that:

err(g0)err(h)+ρD(h,g0)ε/4 (7)

We define two events E1 and E2. Let Inline graphicG be the set of all gInline graphic for which err(g) ≥ ε. The event E1 is the event that mingInline graphic err(g, S) > 9ε/10, and let Ē1 be its complement. Applying the multiplicative Chernoff bounds, for a specific gInline graphic,

PrS~Pm[err(g,S)<910err(g)]e-Serr(g)/400e-Sε/400.

The quantity on the right hand side is at most δ4Gδ4G1 for a large enough constant C in Equation (5). Applying an union bound over all gInline graphic, we get that

PrS~Pm[E¯1]δ/4. (8)

We define E2 as the event that err(g0, S) ≤ 3ε/4, and Ē2 as its complement. From a standard multiplicative Chernoff bound, with probability at least 1 − δ/4,

err(g0,S)err(g)+3err(g)ln(4/δ)Sε4+3ε4·ln(4/δ)Sε4+3ε4·ε3=3ε4

Thus, if |S| ≥ (3/ε) log(4/δ), which is the case due to Equation (5),

PrS~Pm[E¯2]=PrS~Pm[err(g0,S)>3ε4]δ4. (9)

Therefore, we have

PrS~Pm,A[err(hA)>ε]PrS~Pm,A[err(hA)>εE1E2]+PrS~Pm[E¯1]+PrS~Pm[E¯2]PrS~Pm,A[err(hA,S)>err(g0,S)+(910-34)εE1E2]+δ/4+δ/4PrS~Pm,A[err(hA,S)>err(g0,S)+3ε20E1E2]+δ/2Gexp(-3εS20)+δ/2(16κ^ε)dUexp(-3εS20)+δ/2δ/2+δ/2=δ.

Here, the second step follows from the definition of events E1 and E2 and from Equations (8) and (9), the third step follows from simple algebra, the fourth step follows from Lemma 18, the fifth step from the bound on | Inline graphic| and the final step from Equation (5).

C.4. Examples

Lemma 20

Let Inline graphic be uniform over the unit sphere Inline graphic, and let Inline graphic be defined as in Example 2. Then, Inline graphic is

11-2exp(-nγ2/2)-smooth

with respect to Inline graphic.

Proof

From (Ball, 1997), we know that Prx~ Inline graphic[xW] ≥ 1 − 2 exp(−2/2). Thus, for any set AInline graphic, we have

Prx~D[xA]=Prx~D[xAW]=Prx~U[xAW]Prx~U[xW]Prx~U[xA]1-2exp(-nγ2/2).

This means Inline graphic is κ-smooth with respect to Inline graphic for 1

κ=11-2exp(-nγ2/2).

Lemma 21

Let Inline graphic uniform over the unit sphere Inline graphic and let Inline graphic be defined as in Example 3. Then, Inline graphic is

(21-γ)n-12-smooth

with respect to Inline graphic.

Proof

From (Ball, 1997), we know that Prx~ Inline graphic[xInline graphic \ W] = Prx~ Inline graphic[xW] ≥ ((1 − γ)/2)(n−1)/2. Therefore, for any AInline graphic, we have

Prx~D[xA]=Prx~D[xA\W]=Prx~U[xA\W]Prx~U[xSn-1\W]Prx~U[xA](1-γ2)n-12.

This means Inline graphic is κ-smooth with respect to Inline graphic for

κ=(21-γ)n-12.

Appendix D. Proofs from Section 5

D.1. Some lemmas

Lemma 22

Let S := {(x1, y1), …, (xm, ym)} ⊆ χ × {±1} be a labeled dataset of size m, α ∈ (0, 1), and k ≥ 0.

  1. If a learning algorithm Inline graphic guarantees α-privacy and outputs a hypothesis from Inline graphic, then for all S:={(x1,y1),,(xm,ym)}X×{±1} with (xi,yi)=(xi,yi) for at least |S| − k such examples,
    GHPrA[A(S)G]PrA[A(S)G]·exp(-kα).
  2. If a learning algorithm Inline graphic guarantees α-label privacy and outputs a hypothesis from Inline graphic, then for all S:={(x1,y1),,(xm,ym)}X×{±1} with yi=yi

    for at least |S| − k such labels,
    GHPrA[A(S)G]PrA[A(S)G]·exp(-kα).

Proof

We prove just the first part, as the second part is similar. For a labeled dataset S′ that differs from S in at most k pairs, there exists a sequence of datasets S(0), …, S() with k such that S(0) = S′, S() = S, and S(j) differs from S(j+1) in exactly one example for 1 ≤ j < . In this case, if Inline graphic guarantees α-privacy, then for all Inline graphicInline graphic,

PrA[A(S(0))G]PrA[A(S(1))G]·eαPrA[A(S(2))G]·e2αPrA[A(S())G]·eα

and therefore

PrA[A(S)G]PrA[A(S)G]·e-αPrA[A(S)G]·e-kα.

Lemma 23

There exists a constant C > 1 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V, and let Inline graphic be distribution over χ. Fix any r ∈ (0, 1), and let X be an i.i.d. sample of size m from Inline graphic. If

mCVτlogCr,

then the following holds with probability at least 1/2:

  1. every pair of hypotheses{h, h′} ⊆ Inline graphic for which ρX(h, h′) > 2r has ρInline graphic(h, h′) > r;

  2. for all h0Inline graphic, every (6r)-packing of (BInline graphic(h0, 12r), ρInline graphic) is a (4r)-packing of (BX(h0, 16r), ρX).

Proof

This is a consequence of Lemma 17. To show the first part, we plug in Lemma 17 with εm = r/2.

To show the second part, we use two applications of Lemma 17. Let h and h′ be any two hypotheses in any (6r)-packing of (BInline graphic(h0, 12r), ρInline graphic); we first use Lemma 17 with εm = r/3 to show that for all such h and h′, ρX(h, h′) > 4r. Next we need to show that all h in any (6r)-packing of (BInline graphic(h0, 12r), ρInline graphic) has ρX(h, h0) ≤ 16r; we show this through a second application of Lemma 17 with εm = r/3.

D.2. Proof of Theorem 12

We prove the contrapositive: that if ε ≤ Δ/2 and PrX~ Inline graphic, Inline graphic[ Inline graphic(SX,h*) ∈ BInline graphic(h*, ε)] > 1/2 for all h* ∈ H, then m > log(2d″−1)/α. So pick any ε ≤ Δ/2. By Lemma 15, there exists an h0Inline graphic and PInline graphic such that P is a (2ε)-packing of (BInline graphic(h0, 4ε), ρInline graphic) of size ≥ 2d. For any h, hP such that hh′, we have BInline graphic(h, ε) ∩ BInline graphic(h′, ε) = ∅ by the triangle inequality. Therefore for any hP and any X′ ⊆ χ of size m,

PrA[A(SX,h)BD(h,ε)]hP\{h}PrA[A(SX,h)BD(h,ε)]hP\{h}PrA[A(SX,h)BD(h,ε)]·e-αm,

where the second inequality follows by Lemma 22 because SX′,h and SX′,h can differ in at most (all) m labels. Now integrating both sides with respect to X′ ~ Inline graphic shows that if PrX~ Inline graphic, Inline graphic[ Inline graphic(SX,h*) ∈ BInline graphic(h*, ε)] > 1/2 for all h* ∈ Inline graphic, then for any hP,

12>PrX~Dm,A[A(SX,h)BD(h,ε)]hP\{h}PrX~Dm,A[A(SX,h)BD(h,ε)]·e-αm>(P-1)·12·e-αm

which in turn implies m > log(|P| − 1)/α ≥ log(2d − 1)/α ≥ log(2d″−1)/α, as d″ is always ≥ 1.

D.3. Proof of Lemma 11

Let h0Inline graphic and P be an s-packing of BX(h0, 4s) ⊆ Inline graphic. Say the algorithm Inline graphic is good for h if PrInline graphic[ Inline graphic(SX,h) ∈ BX(h, s/2)] ≥ 1/2. Note that Inline graphic is not good for hP if and only if PrInline graphic[ Inline graphic(SX,h) ∉/BX(h, s/2)] > 1/2. Therefore, it suffices to show that if Inline graphic is good for at least |P|/2 hypotheses in P, then m ≥ (log((|P|/2) −1)/(8αs).

By the triangle inequality and the fact that P is an s-packing, BX(h, s/2)∩BX(h′, s/2) = ∅ for all h, h′ ∈ P such that hh′. Therefore for any hP,

PrA[A(SX,h)BX(h,s/2)]hP\{h}PrA[A(SX,h)BX(h,s/2)].

Moreover, for all h, h′ ∈ P, we have ρX(h, h′) ≤ ρX(h0, h) + ρX(h0, h′) ≤ 8s by the triangle inequality, so SX,h and SX,h differ in at most 8sm labels. Therefore Lemma 22 implies

PrA[A(SX,h)BX(h,s/2)]PrA[A(SX,h)BX(h,s/2)]·e-8sm

for all h, h′ ∈ P. If Inline graphic is good for at least |P|/2 hypotheses h′ ∈ P, then for any hP such that Inline graphic is good for h, we have

graphic file with name nihms464313e2.jpg

which in turn implies m ≥ log((|P|/2) − 1)/(8s).

D.4. Proof of Theorem 10

We need the following lemma.

Lemma 24

There exists a constant C > 1 such that the following holds. Let Inline graphic be a hypothesis class with VC dimension V, Inline graphic be a distribution over χ, X be an i.i.d. sample from Inline graphic of size m, Inline graphic be a learning algorithm that guarantees α-label privacy and outputs a hypothesis in Inline graphic, and Δ be the diameter of ( Inline graphic, ρInline graphic). If r ∈ (0, Δ/6) and

CVrlogCrm<log(2d-1-1)32αr

where d := ddim12r( Inline graphic, ρInline graphic), then there exists a hypothesis h* ∈ Inline graphic such that

PrX~Dm,A[A(SX,h)BD(h,r)]18.

Proof

First, assume r and m satisfy the conditions in the lemma statement, where C is the constant from Lemma 23. Also, let h0Inline graphic and PInline graphic be such that P is a (6r)-packing of (BInline graphic(h0, 12r), ρInline graphic) of size |P| ≥ 2d; the existence of such an h0 and P is guaranteed by Lemma 15.

We first define some events in the sample space of X and Inline graphic. For each hInline graphic, and a sample X, let E1(h, X) be the event that

Inline graphic(SX,h) makes more than 2rm mistakes on SX,h (i.e., ρX(h, Inline graphic(SX,h)) > 2r).

Given a sample X, let φ(X) be a 0/1 random variable which is 1 when the following conditions hold:

  1. every pair of hypotheses{h, h′} ⊆ Inline graphic for which ρX(h, h′) > 2r has ρInline graphic(h, h′) > r; and

  2. for all h0Inline graphic, every (6r)-packing of (BInline graphic(h0, 12r), ρInline graphic) is a (4r)-packing of (BX(h0, 16r), ρX)

(i.e., the conclusion of Lemma 23). Note that conditioned on E1(h, X) and φ(X) = 1, we have ρX(h, Inline graphic(SX,h)) > 2r and thus ρInline graphic(h, Inline graphic(SX,h)) > r, so PrX~ Inline graphic, Inline graphic[E1(h, X), (φ(X) = 1)] ≤ PrX~ Inline graphic, Inline graphic[ρInline graphic(h, Inline graphic(SX,h)) > r]. Therefore it suffices to show that there exists h* ∈ Inline graphic such that PrX~ Inline graphic, Inline graphic[E1(h*, X), (φ(X) = 1)] ≥ 1/8.

The lower bound on m and Lemma 23 ensure that 1

PrX~Dm[φ(X)=1]12. (10)

Also, if the unlabeled sample X is such that φ(X) = 1 holds, then the set P is a (4r)-packing of (BX(h0, 16r), ρX). Therefore, the upper bound on m and Lemma 11 (with s = 4r) imply that for all such X, there exists QP of size at least |P|/2 such that PrInline graphic[E1(h, X) | φ(X) = 1] ≥ 1/2 for all hQ. In other words,

graphic file with name nihms464313e3.jpg (11)

Combining Equations (10) and (11) gives

graphic file with name nihms464313e4.jpg

Here the first step follows because φ(X) is a 0/1 random variable, the fourth step follows from Equation (11) and the fifth step follows from Equation (10).

Therefore there exists some h* ∈ P such that PrX~ Inline graphic, Inline graphic[E1(h*, X), φ(X) = 1] ≥ 1/8.

Proof [Proof of Theorem 10]

Assume

ε<Δ24CVlog(6C/Δ),αlog(2d-1-1)32CVlog(C/ε),andm<log(2d-1-1)32αε

where C is the constant from Lemma 24. The proof is by case analysis, based on the value of m.

Case 1: m < 1/(4ε)

Since ε < Δ/2, Lemma 15 implies that there exists a pair{h, h′} ⊆ Inline graphic such that ρInline graphic(h, h′) > 2ε but ρInline graphic(h, h′) ≤ 4ε. Using the bound on m and the fact ε ≤ 1/5, we have

PrX~Dm[ρX(h,h)=0](1-4ε)m(1-4ε)14ε>18.

This means that PrX~ Inline graphic[hInline graphic:= Inline graphic(SX,h) = Inline graphic(SX,h)] ≥ 1/8. By the triangle inequality, BInline graphic(h, ε) ∩ BInline graphic(h′, ε) = ∅. So if, say, PrX~ Inline graphic, Inline graphic[hInline graphicBInline graphic(h, ε)] ≥ 1/8, then PrX~ Inline graphic, Inline graphic[hInline graphic ∉/BInline graphic(h′, ε)] ≥1/8. Therefore PrX~ Inline graphic, Inline graphic[hInline graphic ∉/BInline graphic(h*, ε)] ≥ 1/8 for at least one h* ∈ {h, h′}.

Case 2: 1/(4ε) ≤ m < (CV/ε) log(C/ε)

First, let r > 0 be the solution to the equation (CV/r) log(C/r) = m, so r > ε. Moreover, the bound on m and ε imply

m14ε>CVΔ/6logCΔ/6

so r < Δ/6. Finally, using the bound on α, definition of d′, and fact r > ε, we have

αlog(2d-1-1)32CVlogCε<log(2d-1-1)32CVlogCr

where d″ := ddim12r( Inline graphic, ρInline graphic); this implies

m=CVrlogCrlog(2d-1-1)32αr.

The conditions of Lemma 24 are thus satisfied, which means there exists h* ∈ Inline graphic such that PrX~ Inline graphic, Inline graphic[ Inline graphic(SX,h*) ∉/BInline graphic(h*, r)] ≥ 1/8.

Case 3: (CV/ε) log(C/ε) ≤ m < log(2d−1 − 1)/(32αε)

The conditions of Lemma 24 are satisfied in this case with r := ε < Δ/6, so there exists h* ∈ H such that PrX~ Inline graphic, Inline graphic[ρD(h*, Inline graphic(SX,h*)) > ε] ≥ 1/8.

D.5. Example

The following lemma shows that if Inline graphic is the uniform distribution on Inline graphic, then ddimr( Inline graphic, ρInline graphic) ≥ n − 2 for all scales r > 0.

Lemma 25

Let Inline graphic := Inline graphic be the class of linear separators through the origin in ℝn and Inline graphic be the uniform distribution on Inline graphic. For any uInline graphic and any r > 0, there exists an (r/2)-packing of (BInline graphic(hu, r), ρInline graphic) of size at least 2n−2.

Proof

Let μ be the uniform distribution over Inline graphic; notice that this is also the uniform distribution over Inline graphic.

We call a pair hypotheses hv and hw in Inline graphic close if ρInline graphic (hv, hw) ≤ r/2. Observe that if any set of hypotheses has no close pairs, then it is an (r/2)-packing.

Using a technique due to Long (1995), we now construct an (r/2)-packing of BInline graphic(hu, r) by first randomly choosing hypotheses in BInline graphic(hu, r), and then removing hypotheses until no close pairs remain. First, we bound the probability p that two hypotheses hv and hw, chosen independently and uniformly at random from BInline graphic(hu, r), are close:

p=Pr(hv,hw)~μ2[ρD(hv,hw)r/2hvBD(hu,r)hwBD(hu,r)]=Pr(hv,hw)~μ2[ρD(hv,hw)r/2hvBD(hu,r)hwBD(hu,r)]Prhv~μ[hvBD(hu,r)]Pr(hv,hw)~μ2[ρD(hv,hw)r/2hwBD(hu,r)]Prhv~μ[hvBD(hu,r)]=Pr(hv,hw)~μ2[hvBD(hw,r/2)hwBD(hu,r)]Prhv~μ[hvBD(hu,r)]=Pr(hv,hw)~μ2[hvBD(hu,r/2)]Prhv~μ[hvBD(hu,r)]=2-(n-1).

where the second-to-last equality follows by symmetry, and the last equality follows by the fact that BInline graphic(hu, r) corresponds to a (n−1)-dimensional spherical cap of Inline graphic. Now, choose N := 2n−1 hypotheses hv1, …, hvN independently and uniformly at random from BInline graphic(hu, r). The expected number of close pairs among these N hypotheses is

graphic file with name nihms464313e5.jpg

Therefore, there exists N hypotheses hv1, …, hvN in BInline graphic(hu, r) among which there are at most M close pairs. Removing one hypothesis from each such close pair leaves a set of at least NM hypotheses with no close pairs—this is our (r/2)-packing of BInline graphic(hu, r). Since N = 2n−1, the cardinality of this packing is at least

N-M2n-1-2n-1(2n-1-1)2·2-(n-1)>2n-2.

Appendix E. Upper bounds for learning with α-label privacy

Algorithm Inline graphic for learning with α-label privacy, given in Figure 3, differs from the algorithms for learning with α-privacy in that it is able to use the unlabeled data itself to construct a finite set of candidate hypotheses. The algorithm and its analysis are very similar to work due to Chaudhuri et al. (2006); we give the details for completeness.

Theorem 26

Algorithm Inline graphic preserves α-label privacy.

Proof

The algorithm only accesses the labels in S in the final step. It follows from standard arguments in (McSherry and Talwar, 2007) that α-label privacy is guaranteed.

Theorem 27

Let Inline graphic be any probability distribution over χ × {±1} whose marginal over χ is Inline graphic. There exists a universal constant C > 0 such that for any α, ε, δ ∈ (0, 1), the following holds. If Sχ × {±1} is an i.i.d. random sample from Inline graphic of size

mC·(ηε2+1ε)·(Vlog1ε+log1δ)+Cαε·logEX~Dm[Nε/8(H,ρX)]δ

where η := infh′∈ Inline graphic Pr(x,y)~ Inline graphic[h′(x) ≠ y] and V is the VC dimension of Inline graphic; then with probability at least 1 − δ, the hypothesis hInline graphicInline graphic returned by Inline graphic(S, α, ε, δ) satisfies

Pr(x,y)~P[hA(x)y]η+ε.

Remark 28

The first term in the sample size requirement (which depends on VC dimension) can be replaced by distribution-based quantities used for characterizing uniform convergence such as those based on l1-covering numbers (Pollard, 1984).

Proof

Let err(h) := Pr(x,y)~ Inline graphic[h(x) ≠ y], and let h* ∈ Inline graphic minimize err(h) over hInline graphic. Let S := {(x1, y1), …, (xm, ym)} be the i.i.d. sample drawn from Inline graphic, and X := {x1, …, xm} be the unlabeled components of S. Let g0Inline graphic minimize err(g, S) over gInline graphic. Since Inline graphic is an (ε/4)-cover for ( Inline graphic, ρX), we have that err(g0, S) ≤ infh′∈ Inline graphic err(h′, S) + ε/4. Since Inline graphic is also an (ε/4)-packing for ( Inline graphic, ρX), we have that | Inline graphic| ≤ Inline graphic( Inline graphic, ρX) (Pollard, 1984). Let Inline graphic := {fh : hH} where fh(x, y) := Inline graphic[h(x) ≠ y]. We have Inline graphic[fh(x, y)] = err(h) and m−1 Σ(x,y)∈S fh(x, y) = err(h, S). Let E be the event that for all hH,

err(h,S)err(h)+err(h)εm+εmanderr(h)err(h,S)+err(h)εm

where εm := (8V log(2em/V) + 4 log(16/δ))/m. By Lemma 16, the fact Inline graphic( Inline graphic, n) = S(H, n), and union bounds, we have PrS~ Inline graphic [E] ≥ 1 − δ/2. Now let E′ be the event that

err(hA,S)err(g0,S)+tm

where tm := 2 log(2 Inline graphic[| Inline graphic|]/δ)/(αm). The probability of E′ can be bounded as

PrS~Pm,A[E]=1-PrS~Pm,A[err(hA,S)>err(g0,S)+tm]=1-ES~Pm[PrA[err(hA,S)>err(g0,S)+tmS]]1-ES~Pm[Gexp(-αmtm2)]=1-EX~Dm[G]·exp(-αmtm2)1-δ2

where the first inequality follows from Lemma 18, and the second inequality follows from the definition of tm. By the union bound, PrS~ Inline graphic, Inline graphic[EE′] ≥ 1 −δ. In the event EE′, we have

err(hA)-err(h)err(hA)-err(hA,S)+err(h,S)-err(h)+err(hA,S)-err(h,S)err(hA)εm+err(h)εm+εm+err(g0,S)-err(h,S)+tmerr(hA)εm+err(h)εm+εm+ε/4+tm

since err(g0, S) ≤ infh′∈ Inline graphic err(h′, S) + ε/4 ≤ err(h*, S) + ε/4. By various algebraic manipulations, this in turn implies

err(hA)err(h)+C·(err(h)εm+εm+tm)+ε/2

for some constant C′ > 0. The lower bound on m now implies the theorem.

Footnotes

1

Here the Õ notation hides factors logarithmic in 1/ε

Contributor Information

Kamalika Chaudhuri, Email: kamalika@cs.ucsd.edu, University of California, San Diego, 9500 Gilman Drive #0404, La Jolla, CA 92093-0404.

Daniel Hsu, Email: dahsu@microsoft.com, Microsoft Research New England, One Memorial Drive, Cambridge, MA 02142.

References

  1. Agrawal R, Srikant R. Privacy-preserving data mining. SIGMOD Rec. 2000;29(2):439–450. doi: http://doi.acm.org/10.1145/335191.335438. [Google Scholar]
  2. Backstrom Lars, Dwork Cynthia, Kleinberg Jon M. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: Williamson Carey L, Zurko Mary Ellen, Patel-Schneider Peter F, Shenoy Prashant J., editors. WWW. ACM; 2007. pp. 181–190. [Google Scholar]
  3. Ball K. An elementary introduction to modern convex geometry. In: Levy Silvio., editor. Flavors of Geometry. Vol. 31. 1997. [Google Scholar]
  4. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. PODS. 2007:273–282. [Google Scholar]
  5. Beimel Amos, Kasiviswanathan Shiva Prasad, Nissim Kobbi. Bounds on the sample complexity for private learning and private data release. TCC. 2010:437–454. [Google Scholar]
  6. Blum A, Dwork C, McSherry F, Nissim K. Practical privacy: the sulq framework. PODS. 2005:128–138. [Google Scholar]
  7. Blum A, Ligett K, Roth A. A learning theory approach to non-interactive database privacy. In: Ladner RE, Dwork C, editors. STOC. ACM; 2008. pp. 609–618. [Google Scholar]
  8. Bshouty Nader H, Li Yi, Long Philip M. Using the doubling dimension to analyze the generalization of learning algorithms. J Comput Syst Sci. 2009;75(6):323–335. [Google Scholar]
  9. Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Manuscript. 2006. Learning concept classes with privacy. [Google Scholar]
  10. Chaudhuri K, Monteleoni C, Sarwate A. Differentially private empirical risk minimization. Journal of Machine Learning Research. 2011 [PMC free article] [PubMed] [Google Scholar]
  11. Chaudhuri Kamalika, Mishra Nina. When random sampling preserves privacy. In: Dwork Cynthia., editor. CRYPTO. Vol. 4117. Springer; 2006. pp. 198–213. Lecture Notes in Computer Science. [Google Scholar]
  12. Dasgupta Sanjoy. Coarse sample complexity bounds for active learning. NIPS. 2005 [Google Scholar]
  13. Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. 3rd IACR Theory of Cryptography Conference; 2006. pp. 265–284. [Google Scholar]
  14. Evfimievski A, Gehrke J, Srikant R. Limiting privacy breaches in privacy preserving data mining. PODS. 2003:211–222. [Google Scholar]
  15. Yoav Freund H, Seung Sebastian, Shamir Eli, Tishby Naftali. Selective sampling using the query by committee algorithm. Machine Learning. 1997;28(2–3):133–168. [Google Scholar]
  16. Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008a:265–273. [Google Scholar]
  17. Ganta Srivatsava Ranjit, Kasiviswanathan Shiva Prasad, Smith Adam. Composition attacks and auxiliary information in data privacy. KDD. 2008b:265–273. [Google Scholar]
  18. Gupta Anupam, Krauthgamer Robert, Lee James R. Bounded geometries, fractals, and low-distortion embeddings. FOCS. 2003:534–543. [Google Scholar]
  19. Gupta Anupam, Hardt Moritz, Roth Aaron, Ullman Jonathan. Privately releasing conjunctions and the statistical query barrier. STOC. 2011 [Google Scholar]
  20. Hardt Moritz, Talwar Kunal. On the geometry of differential privacy. STOC. 2010:705–714. [Google Scholar]
  21. Jones Rosie, Kumar Ravi, Pang Bo, Tomkins Andrew. “i know what you did last summer”: query logs and user privacy. CIKM ‘07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management; New York, NY, USA. ACM; 2007. pp. 909–914. doi: http://doi.acm.org/10.1145/1321440.1321573. [Google Scholar]
  22. Kasiviswanathan SA, Lee HK, Nissim K, Raskhodnikova S, Smith A. What can we learn privately? Proc of Foundations of Computer Science. 2008 [Google Scholar]
  23. Kasiviswanathan Shiva Prasad, Smith Adam. A note on differential privacy: Defining resistance to arbitrary side information. CoRR. 2008 abs/0803.3946. [Google Scholar]
  24. Kolmogorov A, Tikhomirov V. ε-entropy and ε-capacity of sets in function spaces. Translations of the American Mathematical Society. 1961;17:277–364. [Google Scholar]
  25. Long PM. On the sample complexity of pac learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks. 1995;6(6):1556–1559. doi: 10.1109/72.471352. [DOI] [PubMed] [Google Scholar]
  26. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-diversity: Privacy beyond k-anonymity. Proc of ICDE. 2006 [Google Scholar]
  27. McAllester David A. Some pac-bayesian theorems. COLT. 1998:230–234. [Google Scholar]
  28. McSherry Frank, Mironov Ilya. Differentially private recommender systems: building privacy into the net. KDD ‘09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining; New York, NY, USA. ACM; 2009. pp. 627–636. doi: http://doi.acm.org/10.1145/1557019.1557090. [Google Scholar]
  29. McSherry Frank, Talwar Kunal. Mechanism design via differential privacy. FOCS. 2007:94–103. [Google Scholar]
  30. Narayanan Arvind, Shmatikov Vitaly. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy; Oakland, CA, USA. May 2008; IEEE Computer Society; pp. 111–125. [Google Scholar]
  31. Nissim Kobbi, Raskhodnikova Sofya, Smith Adam. Smooth sensitivity and sampling in private data analysis. In: Johnson David S, Feige Uriel., editors. STOC. ACM; 2007. pp. 75–84. [Google Scholar]
  32. Pollard D. Convergence of Stochastic Processes. Springer-Verlag; 1984. [Google Scholar]
  33. Roth Aaron. Differential privacy and the fat-shattering dimension of linear queries. APPROX-RANDOM. 2010:683–695. [Google Scholar]
  34. Sweeney L. k-anonymity: a model for protecting privacy. Int J on Uncertainty, Fuzziness and Knowledge-Based Systems. 2002 [Google Scholar]
  35. Vapnik VN, Chervonenkis AY. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications. 1971;16(2):264–280. [Google Scholar]
  36. Wang Rui, Li Yong Fuga, Wang XiaoFeng, Tang Haixu, Zhou Xiao yong. Learning your identity and disease from research papers: information leaks in genome wide association study. ACM Conference on Computer and Communications Security; 2009. pp. 534–544. [Google Scholar]
  37. Yao Andrew Chi-Chih. Protocols for secure computations (extended abstract) FOCS. 1982:160–164. [Google Scholar]
  38. Zhou Shuheng, Ligett Katrina, Wasserman Larry. Differential privacy with compression. Proc of ISIT. 2009 [Google Scholar]

RESOURCES