Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2023 Sep 27.

Published in final edited form as: Annu Rev Biomed Data Sci. 2023 Apr 27;6:153–171. doi: 10.1146/annurev-biodatasci-020722-020704

A conceptual framework for elucidating the data distribution discrepancies among subpopulations and their implications for machine learning. We consider a population consisting of two subpopulations 1 and 2, where $X$ represents the input features for machine learning and $Y$ represents the prediction target variable. From the machine learning perspective, the two subpopulations can be viewed as two domains. Covariate shift is the situation where the marginal distributions of the two domains are different while the conditional distributions of the two domains are the same. Concept drift is the situation where the conditional distributions of the two domains are different while the marginal distributions of the two domains are the same. Dataset shift is a more general situation where the joint distributions of the two domains are different because at least one of the conditional and marginal distributions is different. Given the relationship between the joint, conditional, and marginal distributions, covariate shift and concept drift are two special cases of dataset shift. The dashed curves represent the decision boundaries separating the two classes of the samples (Y = 0 and Y = 1). A decision boundary is determined by the conditional distribution that represents the causal mechanism (136) to generate $Y$ from $X$ .