Abstract
In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results.
1 Introduction
In statistical learning theory, one of the major concerns is to obtain the generalization bound of a learning process, which measures the probability that a function, chosen from a function class by an algorithm, has a sufficiently small error (cf. [1,2]). Generalization bounds have been widely used to study the consistency of the learning process [3], the asymptotic convergence of empirical process [4] and the learnability of learning models [5]. Generally, there are three essential aspects to obtain generalization bounds of a specific learning process: complexity measures of function classes, deviation (or concentration) inequalities and symmetrization inequalities related to the learning process (cf. [3, 4, 6, 7]).
It is noteworthy that the aforementioned results of statistical learning theory are all built under the assumption that training and test data are drawn from the same distribution (or briefly called the assumption of same distribution). This assumption may not be valid in many practical applications such as speech recognition [8] and natural language processing [9] in which training and test data may have different distributions. Domain adaptation has recently been proposed to handle this situation and it is aimed to apply a learning model, trained by using the samples drawn from a certain domain (source domain), to the samples drawn from another domain (target domain) with a different distribution (cf. [10, 11, 12, 13]).
This paper is mainly concerned with two variants of domain adaptation. In the first variant, the learner receives training data from several source domains, known as domain adaptation with multiple sources (cf. [14, 15, 16, 17]). In the second variant, the learner minimizes a convex combination of empirical source and target risk, termed as domain adaptation combining source and target data (cf. [13, 18])1.
1.1 Overview of Main Results
In this paper, we present a new framework to study generalization bounds of the learning processes for domain adaptation with multiple sources and domain adaptation combining source and target data, respectively. Based on the resultant bounds, we then study the asymptotic behavior of the learning process for the two kinds of domain adaptation, respectively. There are three major aspects in the framework: the quantity that measures the difference between two domains, the deviation inequalities and the symmetrization inequalities that are both designed for the situation of domain adaptation2.
Generally, in order to obtain the generalization bounds of a learning process, it is necessary to obtain the corresponding deviation (or concentration) inequalities. For either kind of domain adaptation, we use a martingale method to develop the related Hoeffding-type deviation inequality. Moreover, in the situation of domain adaptation, since the source domain differs from the target domain, the desired symmetrization inequality for domain adaptation should incorporate some quantity to reflect the difference. We then obtain the related symmetrization inequality incorporating the integral probability metric that measures the difference between the distributions of the source and the target domains.
Next, we present the generalization bounds based on the uniform entropy number for both kinds of domain adaptation. Finally, based on the resultant bounds, we give a rigorous theoretic analysis to the asymptotic convergence and the rate of convergence of the learning processes for both types of domain adaptation. Meanwhile, we give a comparison with the related results under the assumption of same distribution. We also present numerical experiments to support our results.
1.2 Organization of the Paper
The rest of this paper is organized as follows. Section 2 introduces the problems studied in this paper. Section 3 introduces the integral probability metric that measures the difference between the distributions of two domains. We introduce the uniform entropy number for the situation of multiple sources in Section 4. In Section 5, we present the generalization bounds for domain adaptation with multiple sources, and then analyze the asymptotic behavior of the learning process for this type of domain adaptation. The last section concludes the paper. In the supplement (part A), we discuss the relationship between the integral probability metric D
(S, T) and the other quantities proposed in the existing works including the
-divergence and the discrepancy distance. Proofs of main results of this paper are provided in the supplement (part B). We study domain adaptation combining source and target data in the supplement (part C) and then give a comparison with the existing works on the theoretical analysis of domain adaptation in the supplement (part D).
2 Problem Setup
We denote
:=
×
⊂ ℝI × ℝJ (1 ≤ k ≤ K) and
:=
×
⊂ ℝI × ℝJ as the k-th source domain and the target domain, respectively. Set L = I + J. Let
and
stand for the distributions of the input spaces
(1 ≤ k ≤ K) and
, respectively. Denote
and
as the labeling functions of
(1 ≤ k ≤ K) and
, respectively. In the situation of domain adaptation with multiple sources, the distributions
(1 ≤ k ≤ K) and
differ from each other, or
and
differ from each other, or both of the cases occur. There are sufficient amounts of i.i.d. samples
drawn from each source domain
(1 ≤ k ≤ K) but little or no labeled samples drawn from the target domain
.
Given w = (w1, ···, wK) ∈ [0, 1]K with
, let gw ∈
be the function that minimizes the empirical risk
| (1) |
over
with respect to sample sets
, and it is expected that gw will perform well on the target expected risk:
| (2) |
i.e., gw approximates the labeling as precisely as possible.
In the learning process of domain adaptation with multiple sources, we are mainly interested in the following two types of quantities:
, which corresponds to the estimation of the expected risk in the target domain
from a weighted combination of the empirical risks in the multiple sources
;E(T) (ℓ ∘ gw) − E(T) (ℓ ∘ g̃*), which corresponds to the performance of the algorithm for domain adaptation with multiple sources,
where g̃* ∈
is the function that minimizes the expected risk E(T) (ℓ ∘ g) over
.
we have
| (3) |
and thus
This shows that the asymptotic behaviors of the aforementioned two quantities, when the sample numbers N1, ···, NK go to infinity, can both be described by the supremum
| (4) |
which is the so-called generalization bound of the learning process for domain adaptation with multiple sources.
For convenience, we define the loss function class
| (5) |
and call
as the function class in the rest of this paper. By (1) and (2), given sample sets
drawn from the multiple sources
respectively, we briefly denote for any f ∈
,
| (6) |
Thus, we can equivalently rewrite the generalization bound (4) for domain adaptation with multiple sources as
| (7) |
3 Integral Probability Metric
As shown in some prior works (e.g. [13, 16, 17, 18, 19, 20]), one of major challenges in the theoretical analysis of domain adaptation is how to measure the distance between the source domain
and the target domain
. Recall that, if
differs from
, there are three possibilities:
differs from
, or
differs from
, or both of them occur. Therefore, it is necessary to consider two kinds of distances: the distance between
and
and the distance between
and
.
In [13, 18], the
-divergence was introduced to derive the generalization bounds based on the VC dimension under the condition of “λ-close”. Mansour et al. [20] obtained the generalization bounds based on the Rademacher complexity by using the discrepancy distance. Both quantities are aimed to measure the difference between two distributions
and
. Moreover, Mansour et al. [17] used the Rényi divergence to measure the distance between two distributions. In this paper, we use the following quantity to measure the difference of the distributions between the source and the target domains:
Definition 3.1
Given two domains
,
⊂ ℝL, let z(S) and z(T) be the random variables taking value from
and
, respectively. Let
⊂ ℝ
be a function class. We define
| (8) |
where the expectations E(S) and E(T) are taken with respect to the distributions
and
, respectively.
The quantity D
(S, T) is termed as the integral probability metric that plays an important role in probability theory for measuring the difference between the two probability distributions (cf. [23, 24, 25, 26]). Recently, Sriperumbudur et al. [27] gave the further investigation and proposed the empirical methods to compute the integral probability metric in practice. As mentioned by Müller [page 432, 25], the quantity D
(S, T) is a semimetric and it is a metric if and only if the function class
separates the set of all signed measures with μ(
) = 0. Namely, according to Definition 3.1, given a non-trivial function class
, the quantity D
(S, T) is equal to zero if the domains
and
have the same distribution.
In the supplement (part A), we discuss the relationship between the quantity D
(S, T) and other quantities proposed in the previous works, and then show that the quantity D
(S, T) can be bounded by the summation of the discrepancy distance and another quantity, which measure the difference between the input-space distributions
and
and the difference between the labeling functions
and
, respectively.
4 The Uniform Entropy Number
Generally, the generalization bound of a certain learning process is achieved by incorporating the complexity measure of the function class, e.g., the covering number, the VC dimension and the Rademacher complexity. The results of this paper are based on the uniform entropy number that is derived from the concept of the covering number and we refer to [22] for more details.
The covering number of a function class
is defined as follows:
Definition 4.1
Let
be a function class and d is a metric on
. For any ξ > 0, the covering number of
at radius ξ with respect to the metric d, denoted by
(
, ξ, d) is the minimum size of a cover of radius ξ.
In some classical results of statistical learning theory, the covering number is applied by letting d be the distribution-dependent metric. For example, as shown in Theorem 2.3 of [22], one can set d as the norm and then derive the generalization bound of the i.i.d. learning process by incorporating the expectation of the covering number, i.e., . However, in the situation of domain adaptation, we only know the information of source domain, while the expectation is dependent on both distributions of source and target domains because z = (x, y). Therefore, the covering number is no longer applicable to our framework to obtain the generalization bounds for domain adaptation. By contrast, uniform entropy number is distribution-free and thus we choose it as the complexity measure of function class to derive the generalization bounds for domain adaptation.
For clarity of presentation, we give a useful notation for the following discussion. For any 1 ≤ k ≤ K, given a sample set
drawn from
, we denote
as the ghost-sample set drawn from
such that the ghost sample
has the same distribution as
for any 1 ≤ n ≤ Nk and any 1 ≤ k ≤ K. Denoting
. Moreover, given any f ∈
and any w = (w1, ···, wK) ∈ [0, 1]K with
, we introduce a variant of the ℓ1 norm:
| (9) |
It is noteworthy that the variant
of the ℓ1 norm is still a norm on the functional space, which can be easily verified by using the definition of norm, so we omit it here. In the situation of domain adaptation with multiple sources, setting the metric d as
, we then define the uniform entropy number of
with respect to the metric
as
| (10) |
5 Domain Adaptation with Multiple Sources
In this section, we present the generalization bound for domain adaptation with multiple sources. Based on the resultant bound, we then analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation.
5.1 Generalization Bounds for Domain Adaptation with Multiple Sources
Based on the aforementioned uniform entropy number, the generalization bound for domain adaptation with multiple sources is presented in the following theorem:
Theorem 5.1
Assume that
is a function class consisting of the bounded functions with the range [a, b]. Let w = (w1, ···, wK) ∈ [0, 1]K with
. Then, given an arbitrary
, we have for any
and any ε > 0, with probability at least 1 − ε,
| (11) |
where and
| (12) |
In the above theorem, we show that the generalization bound can be bounded by the right-hand side of (11). Compared to the classical result under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]): with probability at least 1 − ε,
| (13) |
with EN f being the empirical risk with respect to the sample set
, there is a discrepancy quantity
that is determined by the two factors: the choice of w and the quantities D
(Sk, T) (1 ≤ k ≤ K). The two results will coincide if any source domain and the target domain match, i.e., D
(Sk, T) = 0 holds for any 1 ≤ k ≤ K.
In order to prove this result, we develop the related Hoeffding-type deviation inequality and the symmetrization inequality for domain adaptation with multiple sources, respectively. The detailed proof is provided in the supplement (part B). By using the resultant bound (11), we can analyze the asymptotic behavior of the learning process for domain adaptation with multiple sources.
5.2 Asymptotic Convergence
In statistical learning theory, it is well-known that the complexity of the function class is the main factor to the asymptotic convergence of the learning process under the assumption of same distribution (cf. [3, 4, 22]).
Theorem 5.1 can directly lead to the concerning theorem showing that the asymptotic convergence of the learning process for domain adaptation with multiple sources:
Theorem 5.2
Assume that
is a function class consisting of bounded functions with the range [a, b]. Let w = (w1, ···, wK) ∈ [0, 1]K with
. If the following condition holds:
| (14) |
with , then we have for any ,
| (15) |
As shown in Theorem 5.2, if the choice of w ∈ [0, 1]K and the uniform entropy number
satisfy the condition (14) with
, the probability of the event that
will converge to zero for any
, when the sample numbers N1, ···, NK of multiple sources go to infinity, respectively. This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]): the probability of the event that “supf∈
|Ef − EN f| > ξ” will converge to zero for any ξ > 0, if the uniform entropy number ln
(
, ξ, N) satisfies the following:
| (16) |
Note that in the learning process of domain adaptation with multiple sources, the uniform convergence of the empirical risk on the source domains to the expected risk on the target domain may not hold, because the limit (15) does not hold for any ξ > 0 but for any . By contrast, the limit (15) holds for all ξ > 0 in the learning process under the assumption of same distribution, if the condition (16) is satisfied.
By Cauchy-Schwarz inequality, setting minimizes the second term of the right-hand side of (11) and then we arrive at
| (17) |
which implies that setting can result in the fastest rate of convergence and our numerical experiments presented in the next section also support this point (cf. Fig. 1).
Figure 1.
Domain Adaptation with Multiple Sources
6 Numerical Experiments
We have performed the numerical experiments to verify the theoretic analysis of the asymptotic convergence of the learning process for domain adaptation with multiple sources. Without loss of generality, we only consider the case of K = 2, i.e., there are two source domains and one target domain. The experiment data are generated in the following way:
For the target domain
=
×
⊂ ℝ100×ℝ, we consider
as a Gaussian distribution N(0, 1) and draw
from
randomly and independently. Let β ∈ ℝ100 be a random vector of a Gaussian distribution N(1, 5), and let the random vector R ∈ ℝ100 be a noise term with R ~ N(0, 0.5). For any 1 ≤ n ≤ NT, we randomly draw β and R from N (1, 5) and N(0, 0.01) respectively, and then generate
as follows:
| (18) |
The derived
are the samples of the target domain
and will be used as the test data.
In the similar way, we derive the sample set
of the source domain
=
×
⊂ ℝ100 × ℝ: for any 1 ≤ n ≤ N1,
| (19) |
where , β ~ N(1, 5) and R ~ N(0, 0.5).
For the source domain
=
×
⊂ ℝ100 × ℝ, the samples
are generated in the following way: for any 1 ≤ n ≤ N2,
| (20) |
where , β ~ N(1, 5) and R ~ N(0, 0.5).
In this experiment, we use the method of Least Square Regression to minimize the empirical risk
| (21) |
for different combination coefficients w ∈ {0.1, 0.3, 0.5, 0.9} and then compute the discrepancy for each N1 + N2. The initial N1 and N2 both equal to 200. Each test is repeated 30 times and the final result is the average of the 30 results. After each test, we increment both N1 and N2 by 200 until N1 = N2 = 2000. The experiment results are shown in Fig. 1.
From Fig. 1, we can observe that for any choice of w, the curve of is decreasing when N1 + N2 increases, which is in accordance with the results presented in Theorems 5.1 & 5.2. Moreover, when w = 0.5, the discrepancy has the fastest rate of convergence, and the rate becomes slower as w is further away from 0.5. In this experiment, we set N1 = N2 that implies that N2/(N1 + N2) = 0.5. Recalling (17), we have shown that w = N2/(N1 + N2) will provide the fastest rate of convergence and this proposition is supported by the experiment results shown in Fig. 1.
7 Conclusion
In this paper, we present a new framework to study the generalization bounds of the learning process for domain adaptation. We use the integral probability metric to measure the difference between the distributions of two domains. Then, we use a martingale method to develop the specific deviation inequality and the symmetrization inequality incorporating the integral probability metric. Next, we utilize the resultant deviation inequality and symmetrization inequality to derive the generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation.
We point out that the asymptotic convergence of the learning process is determined by the complexity of the function class
measured by the uniform entropy number. This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]). Moreover, the rate of convergence of this learning process is equal to that of the learning process under the assumption of same distribution. The numerical experiments support our results. Finally, we give a comparison with the previous works [13, 14, 15, 16, 17, 18, 20] (cf. supplement, part D).
It is noteworthy that by Theorem 2.18 of [22], the generalization bound (11) can lead to the result based on the fat-shattering dimension. According to Theorem 2.6.4 of [4], the bound based on the VC dimension can also be obtained from the result (11). In our future work, we will attempt to find a new distance between two distributions and develop the generalization bounds based on other complexity measures, e.g., Rademacher complexities, and analyze other theoretical properties of domain adaptation.
Acknowledgments
This research is sponsored in part by NSF (IIS-0953662, CCF-1025177), NIH (LM010730), and ONR (N00014-1-1-0108).
Footnotes
Due to the page limitation, the discussion on domain adaptation combining source and target data is provided in the supplement (part C).
Due to the page limitation, we only present the generalization bounds for domain adaptation with multiple sources and the discussions of the corresponding deviation inequalities and symmetrization inequalities are provided in the supplement (part B) along with the proofs of main results.
Contributor Information
Chao Zhang, Email: czhan117@asu.edu.
Lei Zhang, Email: zhanglei.njust@yahoo.com.cn.
Jieping Ye, Email: jieping.ye@asu.edu.
References
- 1.Vapnik VN. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999;10(5):988–999. doi: 10.1109/72.788640. [DOI] [PubMed] [Google Scholar]
- 2.Bousquet O, Boucheron S, Lugosi G. Introduction to Statistical Learning Theory. In: Bousquet O, et al., editors. Advanced Lectures on Machine Learning. 2004. pp. 169–207. [Google Scholar]
- 3.Vapnik VN. Statistical Learning Theory. New York: John Wiley and Sons; 1998. [Google Scholar]
- 4.Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM. 1989;36(4):929–965. [Google Scholar]
- 5.van der Vaart A, Wellner J. Weak Convergence and Empirical Processes With Applications to Statistics (Hardcover) Springer; 2000. [Google Scholar]
- 6.Bartlett PL, Bousquet O, Mendelson S. Local Rademacher Complexities. Annals of Statistics. 2005;33:1497–1537. [Google Scholar]
- 7.Hussain Z, Shawe-Taylor J. Improved Loss Bounds for Multiple Kernel Learning. Journal of Machine Learning Research - Proceedings Track. 2011;15:370–377. [Google Scholar]
- 8.Jiang J, Zhai C. Instance Weighting for Domain Adaptation in NLP. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 264–271. [Google Scholar]
- 9.Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL); 2007. pp. 440–447. [Google Scholar]
- 10.Bickel S, Brückner M, Scheffer T. Discriminative learning for differing training and test distributions. Proceedings of the 24th international conference on Machine learning (ICML); 2007. pp. 81–88. [Google Scholar]
- 11.Wu P, Dietterich TG. Improving SVM accuracy by training on auxiliary data sources. Proceedings of the twenty-first international conference on Machine learning (ICML); 2004. pp. 871–878. [Google Scholar]
- 12.Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. Conference on Empirical Methods in Natural Language Processing (EMNLP); 2006. pp. 120–128. [Google Scholar]
- 13.Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. A Theory of Learning from Different Domains. Machine Learning. 2010;79:151–175. [Google Scholar]
- 14.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Advances in Neural Information Processing Systems (NIPS) 2006 [Google Scholar]
- 15.Crammer K, Kearns M, Wortman J. Learning from Multiple Sources. Journal of Machine Learning Research. 2008;9:1757–1774. [Google Scholar]
- 16.Mansour Y, Mohri M, Rostamizadeh A. Domain adaptation with multiple sources. Advances in Neural Information Processing Systems (NIPS) 2008:1041–1048. [Google Scholar]
- 17.Mansour Y, Mohri M, Rostamizadeh A. Multiple Source Adaptation and The Rényi Divergence. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI).2009. [Google Scholar]
- 18.Blitzer J, Crammer K, Kulesza A, Pereira F, Wortman J. Learning Bounds for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2007 [Google Scholar]
- 19.Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of Representations for Domain Adaptation. Advances in Neural Information Processing Systems (NIPS) 2006:137–144. [Google Scholar]
- 20.Mansour Y, Mohri M, Rostamizadeh A. Domain Adaptation: Learning Bounds and Algorithms. Conference on Learning Theory (COLT).2009. [Google Scholar]
- 21.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association. 1963;58(301):13–30. [Google Scholar]
- 22.Mendelson S. A Few Notes on Statistical Learning Theory. Lecture Notes in Computer Science. 2003;2600:1–40. [Google Scholar]
- 23.Zolotarev VM. Probability Metrics. Theory of Probability and its Application. 1984;28(1):278–302. [Google Scholar]
- 24.Rachev ST. Probability Metrics and the Stability of Stochastic Models. John Wiley and Sons; 1991. [Google Scholar]
- 25.Müller A. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability. 1997;29(2):429–443. [Google Scholar]
- 26.Reid MD, Williamson RC. Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research. 2011;12:731–817. [Google Scholar]
- 27.Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B. A Note on Integral Probability Metrics and φ-Divergences. CoRR. 2009 abs/0901.2698. [Google Scholar]

