Generalization Bounds for Domain Adaptation

Chao Zhang; Lei Zhang; Jieping Ye

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: Adv Neural Inf Process Syst. 2012;4:3320–3328.

Generalization Bounds for Domain Adaptation

Chao Zhang ¹, Lei Zhang ², Jieping Ye ^1,³

PMCID: PMC4191871 NIHMSID: NIHMS497504 PMID: 25309109

Abstract

In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results.

1 Introduction

In statistical learning theory, one of the major concerns is to obtain the generalization bound of a learning process, which measures the probability that a function, chosen from a function class by an algorithm, has a sufficiently small error (cf. [1,2]). Generalization bounds have been widely used to study the consistency of the learning process [3], the asymptotic convergence of empirical process [4] and the learnability of learning models [5]. Generally, there are three essential aspects to obtain generalization bounds of a specific learning process: complexity measures of function classes, deviation (or concentration) inequalities and symmetrization inequalities related to the learning process (cf. [3, 4, 6, 7]).

It is noteworthy that the aforementioned results of statistical learning theory are all built under the assumption that training and test data are drawn from the same distribution (or briefly called the assumption of same distribution). This assumption may not be valid in many practical applications such as speech recognition [8] and natural language processing [9] in which training and test data may have different distributions. Domain adaptation has recently been proposed to handle this situation and it is aimed to apply a learning model, trained by using the samples drawn from a certain domain (source domain), to the samples drawn from another domain (target domain) with a different distribution (cf. [10, 11, 12, 13]).

This paper is mainly concerned with two variants of domain adaptation. In the first variant, the learner receives training data from several source domains, known as domain adaptation with multiple sources (cf. [14, 15, 16, 17]). In the second variant, the learner minimizes a convex combination of empirical source and target risk, termed as domain adaptation combining source and target data (cf. [13, 18])¹.

1.1 Overview of Main Results

In this paper, we present a new framework to study generalization bounds of the learning processes for domain adaptation with multiple sources and domain adaptation combining source and target data, respectively. Based on the resultant bounds, we then study the asymptotic behavior of the learning process for the two kinds of domain adaptation, respectively. There are three major aspects in the framework: the quantity that measures the difference between two domains, the deviation inequalities and the symmetrization inequalities that are both designed for the situation of domain adaptation².

Generally, in order to obtain the generalization bounds of a learning process, it is necessary to obtain the corresponding deviation (or concentration) inequalities. For either kind of domain adaptation, we use a martingale method to develop the related Hoeffding-type deviation inequality. Moreover, in the situation of domain adaptation, since the source domain differs from the target domain, the desired symmetrization inequality for domain adaptation should incorporate some quantity to reflect the difference. We then obtain the related symmetrization inequality incorporating the integral probability metric that measures the difference between the distributions of the source and the target domains.

Next, we present the generalization bounds based on the uniform entropy number for both kinds of domain adaptation. Finally, based on the resultant bounds, we give a rigorous theoretic analysis to the asymptotic convergence and the rate of convergence of the learning processes for both types of domain adaptation. Meanwhile, we give a comparison with the related results under the assumption of same distribution. We also present numerical experiments to support our results.

1.2 Organization of the Paper

The rest of this paper is organized as follows. Section 2 introduces the problems studied in this paper. Section 3 introduces the integral probability metric that measures the difference between the distributions of two domains. We introduce the uniform entropy number for the situation of multiple sources in Section 4. In Section 5, we present the generalization bounds for domain adaptation with multiple sources, and then analyze the asymptotic behavior of the learning process for this type of domain adaptation. The last section concludes the paper. In the supplement (part A), we discuss the relationship between the integral probability metric D(S, T) and the other quantities proposed in the existing works including the Inline graphic -divergence and the discrepancy distance. Proofs of main results of this paper are provided in the supplement (part B). We study domain adaptation combining source and target data in the supplement (part C) and then give a comparison with the existing works on the theoretical analysis of domain adaptation in the supplement (part D).

2 Problem Setup

We denote Inline graphic := × ⊂ ℝ^I × ℝ^J (1 ≤ k ≤ K) and := × ⊂ ℝ^I × ℝ^J as the k-th source domain and the target domain, respectively. Set L = I + J. Let and stand for the distributions of the input spaces (1 ≤ k ≤ K) and , respectively. Denote $g_{*}^{(S_{k})} : X^{(S_{k})} \to Y^{(S_{k})}$ and $g_{*}^{(T)} : X^{(T)} \to Y^{(T)}$ as the labeling functions of Inline graphic (1 ≤ k ≤ K) and , respectively. In the situation of domain adaptation with multiple sources, the distributions (1 ≤ k ≤ K) and differ from each other, or $g_{*}^{(S_{k})} (1 \leq k \leq K)$ and $g_{*}^{(T)}$ differ from each other, or both of the cases occur. There are sufficient amounts of i.i.d. samples $Z_{1}^{N_{k}} = {z_{n}^{(k)}}_{n = 1}^{N_{k}}$ drawn from each source domain Inline graphic (1 ≤ k ≤ K) but little or no labeled samples drawn from the target domain .

Given w = (w₁, ···, w_K) ∈ [0, 1]^K with $\sum_{k = 1}^{K} w_{k} = 1$ , let g_w ∈ Inline graphic be the function that minimizes the empirical risk

E_{w}^{(S)} (ℓ \circ g) = \sum_{k = 1}^{K} w_{k} E_{N_{k}}^{(S_{k})} (ℓ \circ g) = \sum_{k = 1}^{K} \frac{w_{k}}{N_{k}} \sum_{n = 1}^{N_{k}} ℓ (g (x_{n}^{(k)}), y_{n}^{(k)})

(1)

over Inline graphic with respect to sample sets ${Z_{1}^{N_{k}}}_{k = 1}^{K}$ , and it is expected that g_w will perform well on the target expected risk:

E^{(T)} (ℓ \circ g) : = \int ℓ (g (x^{(T)}), y^{(T)}) d P (z^{(T)}), g \in G,

(2)

i.e., g_w approximates the labeling $g_{*}^{(T)}$ as precisely as possible.

In the learning process of domain adaptation with multiple sources, we are mainly interested in the following two types of quantities:

$E^{(T)} (ℓ \circ g_{w}) - E_{w}^{(S)} (ℓ \circ g_{w})$ , which corresponds to the estimation of the expected risk in the target domain from a weighted combination of the empirical risks in the multiple sources ${Z^{(S_{k})}}_{k = 1}^{K}$ ;
E⁽^T⁾ (ℓ ∘ g_w) − E⁽^T⁾ (ℓ ∘ g̃_*), which corresponds to the performance of the algorithm for domain adaptation with multiple sources,

where g̃_* ∈ Inline graphic is the function that minimizes the expected risk E⁽^T⁾ (ℓ ∘ g) over .

Recalling (1) and (2), since

E_{w}^{(S)} (ℓ \circ {\tilde{g}}_{*}) - E_{w}^{(S)} (ℓ \circ g_{w}) \geq 0,

we have

\begin{array}{l} E^{(T)} (ℓ \circ g_{w}) = E^{(T)} (ℓ \circ g_{w}) - E^{(T)} (ℓ \circ {\tilde{g}}_{*}) + E^{(T)} (ℓ \circ {\tilde{g}}_{*}) \\ \leq E_{w}^{(S)} (ℓ \circ {\tilde{g}}_{*}) - E_{w}^{(S)} (ℓ \circ g_{w}) + E^{(T)} (ℓ \circ g_{w}) - E^{T} (ℓ \circ {\tilde{g}}_{*}) + E^{T} (ℓ \circ {\tilde{g}}_{*}) \\ \leq 2 sup_{g \in G} ∣ E^{(T)} (ℓ \circ g) - E_{w}^{(S)} (ℓ \circ g) ∣ + E^{(T)} (ℓ \circ {\tilde{g}}_{*}), \end{array}

(3)

and thus

0 \leq E^{(T)} (ℓ \circ g_{w}) - E^{(T)} (ℓ \circ {\tilde{g}}_{*}) \leq 2 sup_{g \in G} ∣ E^{(T)} (ℓ \circ g) - E_{w}^{(S)} (ℓ \circ g) ∣ .

This shows that the asymptotic behaviors of the aforementioned two quantities, when the sample numbers N₁, ···, N_K go to infinity, can both be described by the supremum

sup_{g \in G} ∣ E^{(T)} (ℓ \circ g) - E_{w}^{(S)} (ℓ \circ g) ∣,

(4)

which is the so-called generalization bound of the learning process for domain adaptation with multiple sources.

For convenience, we define the loss function class

F : = {z \mapsto ℓ (g (x), y) : g \in G},

(5)

and call Inline graphic as the function class in the rest of this paper. By (1) and (2), given sample sets ${Z_{1}^{N_{k}}}_{k = 1}^{K}$ drawn from the multiple sources ${Z^{(S_{k})}}_{k = 1}^{K}$ respectively, we briefly denote for any f ∈ ,

E^{(T)} f : = \int f (z^{(T)}) d P (z^{(T)}); E_{w}^{(S)} f : = \sum_{k = 1}^{K} \frac{w_{k}}{N_{k}} \sum_{n = 1}^{N_{k}} f (z_{n}^{(k)}) .

(6)

Thus, we can equivalently rewrite the generalization bound (4) for domain adaptation with multiple sources as

sup_{f \in F} ∣ E^{(T)} f - E_{w}^{(S)} f ∣ .

(7)

3 Integral Probability Metric

As shown in some prior works (e.g. [13, 16, 17, 18, 19, 20]), one of major challenges in the theoretical analysis of domain adaptation is how to measure the distance between the source domain Inline graphic and the target domain . Recall that, if differs from , there are three possibilities: differs from , or $g_{*}^{(S)}$ differs from $g_{*}^{(T)}$ , or both of them occur. Therefore, it is necessary to consider two kinds of distances: the distance between and and the distance between $g_{*}^{(S)}$ and $g_{*}^{(T)}$ .

In [13, 18], the Inline graphic -divergence was introduced to derive the generalization bounds based on the VC dimension under the condition of “λ-close”. Mansour et al. [20] obtained the generalization bounds based on the Rademacher complexity by using the discrepancy distance. Both quantities are aimed to measure the difference between two distributions Inline graphic and . Moreover, Mansour et al. [17] used the Rényi divergence to measure the distance between two distributions. In this paper, we use the following quantity to measure the difference of the distributions between the source and the target domains:

Definition 3.1

Given two domains Inline graphic , ⊂ ℝ^L, let z⁽^S⁾ and z⁽^T⁾ be the random variables taking value from and , respectively. Let ⊂ ℝ be a function class. We define

D_{F} (S, T) : = sup_{f \in F} ∣ E^{(S)} f - E^{(T)} f ∣,

(8)

where the expectations E⁽^S⁾ and E⁽^T⁾ are taken with respect to the distributions Inline graphic and , respectively.

The quantity D(S, T) is termed as the integral probability metric that plays an important role in probability theory for measuring the difference between the two probability distributions (cf. [23, 24, 25, 26]). Recently, Sriperumbudur et al. [27] gave the further investigation and proposed the empirical methods to compute the integral probability metric in practice. As mentioned by Müller [page 432, 25], the quantity D(S, T) is a semimetric and it is a metric if and only if the function class Inline graphic separates the set of all signed measures with μ( ) = 0. Namely, according to Definition 3.1, given a non-trivial function class , the quantity D(S, T) is equal to zero if the domains and have the same distribution.

In the supplement (part A), we discuss the relationship between the quantity D(S, T) and other quantities proposed in the previous works, and then show that the quantity D(S, T) can be bounded by the summation of the discrepancy distance and another quantity, which measure the difference between the input-space distributions Inline graphic and and the difference between the labeling functions $g_{*}^{(S)}$ and $g_{*}^{(T)}$ , respectively.

4 The Uniform Entropy Number

Generally, the generalization bound of a certain learning process is achieved by incorporating the complexity measure of the function class, e.g., the covering number, the VC dimension and the Rademacher complexity. The results of this paper are based on the uniform entropy number that is derived from the concept of the covering number and we refer to [22] for more details.

The covering number of a function class Inline graphic is defined as follows:

Definition 4.1

Let Inline graphic be a function class and d is a metric on . For any ξ > 0, the covering number of at radius ξ with respect to the metric d, denoted by ( , ξ, d) is the minimum size of a cover of radius ξ.

In some classical results of statistical learning theory, the covering number is applied by letting d be the distribution-dependent metric. For example, as shown in Theorem 2.3 of [22], one can set d as the norm $ℓ_{1} (Z_{1}^{N})$ and then derive the generalization bound of the i.i.d. learning process by incorporating the expectation of the covering number, i.e., $E N (F, ξ, ℓ_{1} (Z_{1}^{N}))$ . However, in the situation of domain adaptation, we only know the information of source domain, while the expectation $E N (F, ξ, ℓ_{1} (Z_{1}^{N}))$ is dependent on both distributions of source and target domains because z = (x, y). Therefore, the covering number is no longer applicable to our framework to obtain the generalization bounds for domain adaptation. By contrast, uniform entropy number is distribution-free and thus we choose it as the complexity measure of function class to derive the generalization bounds for domain adaptation.

For clarity of presentation, we give a useful notation for the following discussion. For any 1 ≤ k ≤ K, given a sample set $Z_{1}^{N_{k}} : = {z_{n}^{(k)}}_{n = 1}^{N_{k}}$ drawn from Inline graphic , we denote ${Z^{'}}_{1}^{N_{k}} : = {{z^{'}}_{n}^{(k)}}_{n = 1}^{N_{k}}$ as the ghost-sample set drawn from such that the ghost sample ${z^{'}}_{n}^{(k)}$ has the same distribution as $z_{n}^{(k)}$ for any 1 ≤ n ≤ N_k and any 1 ≤ k ≤ K. Denoting $Z_{1}^{2 N_{k}} : = {Z_{1}^{N_{k}}, {Z^{'}}_{1}^{N_{k}}}$ . Moreover, given any f ∈ Inline graphic and any w = (w₁, ···, w_K) ∈ [0, 1]^K with $\sum_{k = 1}^{K} w_{k} = 1$ , we introduce a variant of the ℓ₁ norm:

{‖ f ‖}_{ℓ_{1}^{w} ({Z_{1}^{2 N_{k}}}_{k 1}^{K})} : = \sum_{k = 1}^{K} \frac{w_{k}}{N_{k}} \sum_{n = 1}^{N_{k}} (∣ f (z_{n}^{(k)}) ∣ + ∣ f ({z^{'}}_{n}^{(k)}) ∣) .

(9)

It is noteworthy that the variant $ℓ_{1}^{w}$ of the ℓ₁ norm is still a norm on the functional space, which can be easily verified by using the definition of norm, so we omit it here. In the situation of domain adaptation with multiple sources, setting the metric d as $ℓ_{1}^{w} ({Z_{1}^{2 N_{k}}}_{k = 1}^{K})$ , we then define the uniform entropy number of Inline graphic with respect to the metric $ℓ_{1}^{w} ({Z_{1}^{2 N_{k}}}_{k = 1}^{K})$ as

ln N_{1}^{w} (F, ξ, 2 \sum_{k = 1}^{K} N_{k}) : = sup_{{Z_{1}^{2 N_{k}}}_{k = 1}^{K}} ln N (F, ξ, ℓ_{1}^{w} ({Z_{1}^{2 N_{k}}}_{k = 1}^{K})) .

(10)

5 Domain Adaptation with Multiple Sources

In this section, we present the generalization bound for domain adaptation with multiple sources. Based on the resultant bound, we then analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation.

5.1 Generalization Bounds for Domain Adaptation with Multiple Sources

Based on the aforementioned uniform entropy number, the generalization bound for domain adaptation with multiple sources is presented in the following theorem:

Theorem 5.1

Assume that Inline graphic is a function class consisting of the bounded functions with the range [a, b]. Let w = (w₁, ···, w_K) ∈ [0, 1]^K with $\sum_{k = 1}^{K} w_{k} = 1$ . Then, given an arbitrary $ξ > D_{F}^{(w)} (S, T)$ , we have for any $(\prod_{k = 1}^{K} N_{k}) \geq \frac{8 {(b - a)}^{2}}{{(ξ^{'})}^{2}}$ and any ε > 0, with probability at least 1 − ε,

sup_{f \in F} ∣ E_{w}^{(S)} f - E^{(T)} f ∣ \leq D_{F}^{(w)} (S, T) + {(\frac{(ln N_{1}^{w} (F, ξ^{'} / 8, 2 \sum_{k = 1}^{K} N_{k}) - ln (ε / 8))}{\frac{(\prod_{k = 1}^{K} N_{k})}{32 {(b - a)}^{2} (\sum_{k = 1}^{K} w_{k}^{2} (\prod_{i \neq k} N_{i}))}})}^{\frac{1}{2}},

(11)

where $ξ^{'} = ξ - D_{F}^{(w)} (S, T)$ and

D_{F}^{(w)} (S, T) : = \sum_{k = 1}^{K} w_{k} D_{F} (S_{k}, T) .

(12)

In the above theorem, we show that the generalization bound ${sup}_{f \in F} ∣ E^{(T)} f - E_{w}^{(S)} f ∣$ can be bounded by the right-hand side of (11). Compared to the classical result under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]): with probability at least 1 − ε,

sup_{f \in F} ∣ E_{N} f - E f ∣ \leq O ({(\frac{ln N_{1} (F, ξ, N) - ln (ε / 8)}{N})}^{\frac{1}{2}})

(13)

with E_N f being the empirical risk with respect to the sample set $Z_{1}^{N}$ , there is a discrepancy quantity $D_{F}^{(w)} (S, T)$ that is determined by the two factors: the choice of w and the quantities D(S_k, T) (1 ≤ k ≤ K). The two results will coincide if any source domain and the target domain match, i.e., D(S_k, T) = 0 holds for any 1 ≤ k ≤ K.

In order to prove this result, we develop the related Hoeffding-type deviation inequality and the symmetrization inequality for domain adaptation with multiple sources, respectively. The detailed proof is provided in the supplement (part B). By using the resultant bound (11), we can analyze the asymptotic behavior of the learning process for domain adaptation with multiple sources.

5.2 Asymptotic Convergence

In statistical learning theory, it is well-known that the complexity of the function class is the main factor to the asymptotic convergence of the learning process under the assumption of same distribution (cf. [3, 4, 22]).

Theorem 5.1 can directly lead to the concerning theorem showing that the asymptotic convergence of the learning process for domain adaptation with multiple sources:

Theorem 5.2

Assume that Inline graphic is a function class consisting of bounded functions with the range [a, b]. Let w = (w₁, ···, w_K) ∈ [0, 1]^K with $\sum_{k = 1}^{K} w_{k} = 1$ . If the following condition holds:

lim_{N_{1}, \dots, N_{K} \to + \infty} \frac{ln N_{1}^{w} (F, ξ^{'} / 8, 2 \sum_{k = 1}^{K} N_{k})}{\frac{(\prod_{k = 1}^{K} N_{k})}{32 {(b - a)}^{2} (\sum_{k = 1}^{K} w_{k}^{2} (\prod_{i \neq k} N_{i}))}} < + \infty

(14)

with $ξ^{'} = ξ - D_{F}^{(w)} (S, T)$ , then we have for any $ξ > D_{F}^{(w)} (S, T)$ ,

lim_{N_{1}, \dots, N_{K} \to + \infty} Pr {sup_{f \in F} ∣ E^{(T)} f - E_{w}^{(S)} f ∣ > ξ} = 0.

(15)

As shown in Theorem 5.2, if the choice of w ∈ [0, 1]^K and the uniform entropy number $ln N_{1}^{w} (F, ξ^{'} / 8, 2 \sum_{k = 1}^{K} N_{k})$ satisfy the condition (14) with $\sum_{k = 1}^{K} w_{k} = 1$ , the probability of the event that $“ {sup}_{f \in F} ∣ E^{(T)} f - E_{w}^{(s)} f ∣ > ξ ”$ will converge to zero for any $ξ > D_{F}^{(w)} (S, T)$ , when the sample numbers N₁, ···, N_K of multiple sources go to infinity, respectively. This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]): the probability of the event that “sup_f_∈|Ef − E_N f| > ξ” will converge to zero for any ξ > 0, if the uniform entropy number ln Inline graphic ( , ξ, N) satisfies the following:

lim_{N \to + \infty} \frac{ln N_{1} (F, ξ, N)}{N} < + \infty .

(16)

Note that in the learning process of domain adaptation with multiple sources, the uniform convergence of the empirical risk on the source domains to the expected risk on the target domain may not hold, because the limit (15) does not hold for any ξ > 0 but for any $ξ > D_{F}^{(w)} (S, T)$ . By contrast, the limit (15) holds for all ξ > 0 in the learning process under the assumption of same distribution, if the condition (16) is satisfied.

By Cauchy-Schwarz inequality, setting $w_{k} = \frac{N_{k}}{\sum_{k = 1}^{K} N_{k}} (1 \leq k \leq K)$ minimizes the second term of the right-hand side of (11) and then we arrive at

sup_{f \in F} ∣ E_{w}^{(S)} f - E^{(T)} f ∣ \leq \frac{\sum_{k = 1}^{K} N_{k} D_{F} (S_{k}, T)}{\sum_{k = 1}^{K} N_{k}} + {(\frac{(ln N_{1}^{w} (F, ξ^{'} / 8, 2 \sum_{k = 1}^{K} N_{k}) - ln (ε / 8)}{\frac{(\sum_{k = 1}^{K} N_{k})}{32 {(b - a)}^{2}}})}^{\frac{1}{2}},

(17)

which implies that setting $w_{k} = \frac{N_{k}}{\sum_{k = 1}^{K} N_{k}} (1 \leq k \leq K)$ can result in the fastest rate of convergence and our numerical experiments presented in the next section also support this point (cf. Fig. 1).

6 Numerical Experiments

We have performed the numerical experiments to verify the theoretic analysis of the asymptotic convergence of the learning process for domain adaptation with multiple sources. Without loss of generality, we only consider the case of K = 2, i.e., there are two source domains and one target domain. The experiment data are generated in the following way:

For the target domain Inline graphic = × ⊂ ℝ¹⁰⁰×ℝ, we consider as a Gaussian distribution N(0, 1) and draw ${x_{n}^{(T)}}_{n = 1}^{N_{T}} (N_{T} = 4000)$ from randomly and independently. Let β ∈ ℝ¹⁰⁰ be a random vector of a Gaussian distribution N(1, 5), and let the random vector R ∈ ℝ¹⁰⁰ be a noise term with R ~ N(0, 0.5). For any 1 ≤ n ≤ N_T, we randomly draw β and R from N (1, 5) and N(0, 0.01) respectively, and then generate $y_{n}^{(T)} \in Y$ as follows:

y_{n}^{(T)} = 〈 x_{n}^{(T)}, β 〉 + R .

(18)

The derived ${(x_{n}^{(T)}, y_{n}^{(T)})}_{n = 1}^{N_{T}} (N_{T} = 4000)$ are the samples of the target domain Inline graphic and will be used as the test data.

In the similar way, we derive the sample set ${(x_{n}^{(1)}, y_{n}^{(1)})}_{n = 1}^{N_{1}} (N_{1} = 2000)$ of the source domain Inline graphic = × ⊂ ℝ¹⁰⁰ × ℝ: for any 1 ≤ n ≤ N₁,

y_{n}^{(1)} = 〈 x_{n}^{(1)}, β 〉 + R,

(19)

where $x_{n}^{(1)} ~ N (0.5, 1)$ , β ~ N(1, 5) and R ~ N(0, 0.5).

For the source domain Inline graphic = × ⊂ ℝ¹⁰⁰ × ℝ, the samples ${(x_{n}^{(2)}, y_{n}^{(2)})}_{n = 1}^{N_{2}} (N_{2} = 2000)$ are generated in the following way: for any 1 ≤ n ≤ N₂,

y_{n}^{(2)} = 〈 x_{n}^{(2)}, β 〉 + R,

(20)

where $x_{n}^{(2)} ~ N (2, 5)$ , β ~ N(1, 5) and R ~ N(0, 0.5).

In this experiment, we use the method of Least Square Regression to minimize the empirical risk

E_{w}^{(S)} (ℓ \circ g) = \frac{w}{N_{1}} \sum_{n = 1}^{N_{1}} ℓ (g (x_{n}^{(1)}), y_{n}^{(1)}) + \frac{(1 - w)}{N_{2}} \sum_{n = 1}^{N_{2}} ℓ (g (x_{n}^{(2)}), y_{n}^{(2)})

(21)

for different combination coefficients w ∈ {0.1, 0.3, 0.5, 0.9} and then compute the discrepancy $∣ E_{w}^{(S)} f - E_{N_{T}}^{(T)} f ∣$ for each N₁ + N₂. The initial N₁ and N₂ both equal to 200. Each test is repeated 30 times and the final result is the average of the 30 results. After each test, we increment both N₁ and N₂ by 200 until N₁ = N₂ = 2000. The experiment results are shown in Fig. 1.

From Fig. 1, we can observe that for any choice of w, the curve of $∣ E_{w}^{(S)} f - E_{N_{T}}^{(T)} f ∣$ is decreasing when N₁ + N₂ increases, which is in accordance with the results presented in Theorems 5.1 & 5.2. Moreover, when w = 0.5, the discrepancy $∣ E_{w}^{(S)} f - E_{N_{T}}^{(T)} f ∣$ has the fastest rate of convergence, and the rate becomes slower as w is further away from 0.5. In this experiment, we set N₁ = N₂ that implies that N₂/(N₁ + N₂) = 0.5. Recalling (17), we have shown that w = N₂/(N₁ + N₂) will provide the fastest rate of convergence and this proposition is supported by the experiment results shown in Fig. 1.

7 Conclusion

In this paper, we present a new framework to study the generalization bounds of the learning process for domain adaptation. We use the integral probability metric to measure the difference between the distributions of two domains. Then, we use a martingale method to develop the specific deviation inequality and the symmetrization inequality incorporating the integral probability metric. Next, we utilize the resultant deviation inequality and symmetrization inequality to derive the generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation.

We point out that the asymptotic convergence of the learning process is determined by the complexity of the function class Inline graphic measured by the uniform entropy number. This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]). Moreover, the rate of convergence of this learning process is equal to that of the learning process under the assumption of same distribution. The numerical experiments support our results. Finally, we give a comparison with the previous works [13, 14, 15, 16, 17, 18, 20] (cf. supplement, part D).

It is noteworthy that by Theorem 2.18 of [22], the generalization bound (11) can lead to the result based on the fat-shattering dimension. According to Theorem 2.6.4 of [4], the bound based on the VC dimension can also be obtained from the result (11). In our future work, we will attempt to find a new distance between two distributions and develop the generalization bounds based on other complexity measures, e.g., Rademacher complexities, and analyze other theoretical properties of domain adaptation.

Acknowledgments

This research is sponsored in part by NSF (IIS-0953662, CCF-1025177), NIH (LM010730), and ONR (N00014-1-1-0108).

Footnotes

Due to the page limitation, the discussion on domain adaptation combining source and target data is provided in the supplement (part C).

Due to the page limitation, we only present the generalization bounds for domain adaptation with multiple sources and the discussions of the corresponding deviation inequalities and symmetrization inequalities are provided in the supplement (part B) along with the proofs of main results.

Contributor Information

Chao Zhang, Email: czhan117@asu.edu.

Lei Zhang, Email: zhanglei.njust@yahoo.com.cn.

Jieping Ye, Email: jieping.ye@asu.edu.

References

1.Vapnik VN. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999;10(5):988–999. doi: 10.1109/72.788640. [DOI] [PubMed] [Google Scholar]
2.Bousquet O, Boucheron S, Lugosi G. Introduction to Statistical Learning Theory. In: Bousquet O, et al., editors. Advanced Lectures on Machine Learning. 2004. pp. 169–207. [Google Scholar]
3.Vapnik VN. Statistical Learning Theory. New York: John Wiley and Sons; 1998. [Google Scholar]
4.Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM. 1989;36(4):929–965. [Google Scholar]
5.van der Vaart A, Wellner J. Weak Convergence and Empirical Processes With Applications to Statistics (Hardcover) Springer; 2000. [Google Scholar]
6.Bartlett PL, Bousquet O, Mendelson S. Local Rademacher Complexities. Annals of Statistics. 2005;33:1497–1537. [Google Scholar]
7.Hussain Z, Shawe-Taylor J. Improved Loss Bounds for Multiple Kernel Learning. Journal of Machine Learning Research - Proceedings Track. 2011;15:370–377. [Google Scholar]
8.Jiang J, Zhai C. Instance Weighting for Domain Adaptation in NLP. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 264–271. [Google Scholar]
9.Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 440–447. [Google Scholar]
10.Bickel S, Brückner M, Scheffer T. Discriminative learning for differing training and test distributions. Proceedings of the 24th international conference on Machine learning (ICML); 2007. pp. 81–88. [Google Scholar]
11.Wu P, Dietterich TG. Improving SVM accuracy by training on auxiliary data sources. Proceedings of the twenty-first international conference on Machine learning (ICML); 2004. pp. 871–878. [Google Scholar]
12.Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. Conference on Empirical Methods in Natural Language Processing (EMNLP); 2006. pp. 120–128. [Google Scholar]
13.Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. A Theory of Learning from Different Domains. Machine Learning. 2010;79:151–175. [Google Scholar]
14.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Advances in Neural Information Processing Systems (NIPS) 2006 [Google Scholar]
15.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Journal of Machine Learning Research. 2008;9:1757–1774. [Google Scholar]
16.Mansour Y, Mohri M, Rostamizadeh A. Domain adaptation with multiple sources. Advances in Neural Information Processing Systems (NIPS) 2008:1041–1048. [Google Scholar]
17.Mansour Y, Mohri M, Rostamizadeh A. Multiple Source Adaptation and The Rényi Divergence. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI).2009. [Google Scholar]
18.Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. Learning Bounds for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2007 [Google Scholar]
19.Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of Representations for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2006:137–144. [Google Scholar]
20.Mansour Y, Mohri M, Rostamizadeh A. Domain Adaptation: Learning Bounds and Algorithms. Conference on Learning Theory (COLT).2009. [Google Scholar]
21.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association. 1963;58(301):13–30. [Google Scholar]
22.Mendelson S. A Few Notes on Statistical Learning Theory. Lecture Notes in Computer Science. 2003;2600:1–40. [Google Scholar]
23.Zolotarev VM. Probability Metrics. Theory of Probability and its Application. 1984;28(1):278–302. [Google Scholar]
24.Rachev ST. Probability Metrics and the Stability of Stochastic Models. John Wiley and Sons; 1991. [Google Scholar]
25.Müller A. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability. 1997;29(2):429–443. [Google Scholar]
26.Reid MD, Williamson RC. Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research. 2011;12:731–817. [Google Scholar]
27.Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B. A Note on Integral Probability Metrics and φ-Divergences. CoRR. 2009 abs/0901.2698. [Google Scholar]

[R1] 1.Vapnik VN. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999;10(5):988–999. doi: 10.1109/72.788640. [DOI] [PubMed] [Google Scholar]

[R2] 2.Bousquet O, Boucheron S, Lugosi G. Introduction to Statistical Learning Theory. In: Bousquet O, et al., editors. Advanced Lectures on Machine Learning. 2004. pp. 169–207. [Google Scholar]

[R3] 3.Vapnik VN. Statistical Learning Theory. New York: John Wiley and Sons; 1998. [Google Scholar]

[R4] 4.Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM. 1989;36(4):929–965. [Google Scholar]

[R5] 5.van der Vaart A, Wellner J. Weak Convergence and Empirical Processes With Applications to Statistics (Hardcover) Springer; 2000. [Google Scholar]

[R6] 6.Bartlett PL, Bousquet O, Mendelson S. Local Rademacher Complexities. Annals of Statistics. 2005;33:1497–1537. [Google Scholar]

[R7] 7.Hussain Z, Shawe-Taylor J. Improved Loss Bounds for Multiple Kernel Learning. Journal of Machine Learning Research - Proceedings Track. 2011;15:370–377. [Google Scholar]

[R8] 8.Jiang J, Zhai C. Instance Weighting for Domain Adaptation in NLP. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 264–271. [Google Scholar]

[R9] 9.Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 440–447. [Google Scholar]

[R10] 10.Bickel S, Brückner M, Scheffer T. Discriminative learning for differing training and test distributions. Proceedings of the 24th international conference on Machine learning (ICML); 2007. pp. 81–88. [Google Scholar]

[R11] 11.Wu P, Dietterich TG. Improving SVM accuracy by training on auxiliary data sources. Proceedings of the twenty-first international conference on Machine learning (ICML); 2004. pp. 871–878. [Google Scholar]

[R12] 12.Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. Conference on Empirical Methods in Natural Language Processing (EMNLP); 2006. pp. 120–128. [Google Scholar]

[R13] 13.Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. A Theory of Learning from Different Domains. Machine Learning. 2010;79:151–175. [Google Scholar]

[R14] 14.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Advances in Neural Information Processing Systems (NIPS) 2006 [Google Scholar]

[R15] 15.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Journal of Machine Learning Research. 2008;9:1757–1774. [Google Scholar]

[R16] 16.Mansour Y, Mohri M, Rostamizadeh A. Domain adaptation with multiple sources. Advances in Neural Information Processing Systems (NIPS) 2008:1041–1048. [Google Scholar]

[R17] 17.Mansour Y, Mohri M, Rostamizadeh A. Multiple Source Adaptation and The Rényi Divergence. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI).2009. [Google Scholar]

[R18] 18.Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. Learning Bounds for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2007 [Google Scholar]

[R19] 19.Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of Representations for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2006:137–144. [Google Scholar]

[R20] 20.Mansour Y, Mohri M, Rostamizadeh A. Domain Adaptation: Learning Bounds and Algorithms. Conference on Learning Theory (COLT).2009. [Google Scholar]

[R21] 21.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association. 1963;58(301):13–30. [Google Scholar]

[R22] 22.Mendelson S. A Few Notes on Statistical Learning Theory. Lecture Notes in Computer Science. 2003;2600:1–40. [Google Scholar]

[R23] 23.Zolotarev VM. Probability Metrics. Theory of Probability and its Application. 1984;28(1):278–302. [Google Scholar]

[R24] 24.Rachev ST. Probability Metrics and the Stability of Stochastic Models. John Wiley and Sons; 1991. [Google Scholar]

[R25] 25.Müller A. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability. 1997;29(2):429–443. [Google Scholar]

[R26] 26.Reid MD, Williamson RC. Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research. 2011;12:731–817. [Google Scholar]

[R27] 27.Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B. A Note on Integral Probability Metrics and φ-Divergences. CoRR. 2009 abs/0901.2698. [Google Scholar]

PERMALINK

Generalization Bounds for Domain Adaptation

Chao Zhang

Lei Zhang

Jieping Ye

Abstract