“Who experiences large model decay and why?” A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Harvineet Singh; Fan Xia; Alexej Gossmann; Andrew Chuang; Julian C Hong; Jean Feng

. Author manuscript; available in PMC: 2025 Dec 30.

Published in final edited form as: Proc Mach Learn Res. 2025 Jul;267:55757–55787.

“Who experiences large model decay and why?” A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Harvineet Singh ¹, Fan Xia ¹, Alexej Gossmann ², Andrew Chuang ¹, Julian C Hong ¹, Jean Feng ¹

PMCID: PMC12747154 NIHMSID: NIHMS2128097 PMID: 41472673

Abstract

Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designing targeted corrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain how average performance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce a Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT). SHIFT first asks “Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?” (Where?) and, if so, dives deeper to ask “Can we explain this using more detailed variable(subset)-specific shifts?” (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.¹

1. Introduction

ML algorithms are known to degrade in performance when applied in different contexts, which has led to extensive work on explaining how differences in an ML algorithm’s average performance arise (Cai et al., 2023; Zhang et al., 2023). However, performance differences are rarely uniform in practice: some subgroups may experience severe performance degradation while others may experience very negligible differences, if at all (Yang et al., 2023). Understanding the subgroups where shifts are most pronounced and providing subgroup-specific explanations is critical from the perspectives of algorithmic fairness (Mitchell et al., 2021) and backwards compatibility (Srivastava et al., 2020). Subgroup-level explanations can also help model developers design targeted corrective actions that only modify the algorithm’s behavior in the most affected subgroups and limit any other unintended effects (“Don’t fix what ain’t broke”) (Globus-Harris et al., 2022; Suriyakumar et al., 2023).

For instance, suppose an ML algorithm for predicting unplanned readmission achieves overall accuracy of 85% in hospital A and 83% in hospital B. While the change in overall accuracy may not be clinically significant, the change within some subgroup may be sufficiently large to be deemed harmful. If so, it is natural to ask how this heterogeneity in performance decay arose: was it due to a change in how certain diseases are recorded, which medications are prescribed for certain patients, or something else altogether? If we know the affected subgroup and why, we can specifically address the root cause, such as by updating data pre-processing and/or the algorithm within the subgroup.

As such, our goal is to simultaneously understand where an ML algorithm performs substantially worse and how it arose. Numerous methods have been developed to find subgroups where an ML algorithm performs poorly (d’Eon et al., 2022; Eyuboglu et al., 2022; Liu et al., 2023; Subbaswamy et al., 2024), which can in principle be extended to identify subgroups with large model decay. Answering “how” is more tricky. We can obtain an approximate high-level answer by decomposing the average performance drop within an identified subgroup into the contribution from a shift in the marginal distribution of the input features $X$ (covariate shift) versus a shift in the conditional distribution of the target $Y ∣ X$ (outcome shift) (Quinonero-Candela et al., 2009; Cai et al., 2023).

However, this is only a partial solution. For one, it misses situations where the subgroup of individuals experiencing severe covariate shifts is not the same as the one experiencing severe outcome shifts, and each subgroup may require different corrective actions. More importantly, we often want to know precisely which subset of input variables were involved, as many real-world shifts involve only a few variables (i.e. sparse) and can be fixed in a targeted manner (Castro et al., 2020; Finlayson et al., 2021). Existing methods are currently insufficient, as they rely on assumptions that often do not hold in practice, e.g. the true causal graph is known (Zhang et al., 2023; Quintas-Martinez et al., 2024), the data follows simple parametric models (Baron & Kenny, 1986), or unrealistically large datasets (Singh et al., 2024).

To overcome these limitations, we present a nonparametric Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT) (Fig 1). Whereas prior works have approached drift diagnosis primarily through the lens of estimation, SHIFT approaches this through hypothesis testing. The advantage is that hypothesis tests answer simple yes/no questions, which is often more feasible in settings with limited data; in fact, we conduct omnibus tests, which require even less data as they do not need to identify the entire subgroup that is adversely affected. Furthermore, hypothesis tests allow us to check the very assumptions that other works have take on face value. The first stage of SHIFT performs a high-level analysis: decomposing distribution shift into an “aggregate” covariate shift with respect to all of $X$ and an “aggregate” outcome shift with respect to all of $X$ , SHIFT tests if either have led to unacceptably worse performance in any meaningfully large subgroup (Where?). If so, the second stage drills down to test if this can be adequately explained by a shift solely with respect to a sparse subset of variables in $X$ (How?). The major contributions of this work are:

Introduction of a novel hierarchical hypothesis testing framework that detects subgroups experiencing large performance decay due to aggregate-level covariate/outcome shifts, which are then explained using detailed variable(subset)-specific shifts.
SHIFT does not rely on strong assumptions and is suitable for smaller datasets, making it broadly applicable to real-world scenarios.
Our simulations demonstrate that SHIFT correctly identifies relevant shifts. Real-world experiments show that SHIFT can guide the design of model/data corrections that strictly improve performance.

Figure 1: — Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT) is a two-stage hypothesis testing procedure that first checks if there is a subgroup with unacceptably large performance decay due to aggregate covariate and outcome shifts with respect to all $X$ variables. If so, it checks if this can be explained by detailed variable(subset)-specific shifts. Red indicates the shift was flagged for further investigation. In this example, covariate shift is flagged because it affected a subgroup and variables $X_{1}$ and $X_{3}$ were flagged as potential explanations.

2. Related Work

We briefly discuss the three most related areas below (summarized in Table 1). See Appendix E for more detailed discussion as well as other related areas.

Table 1:

Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT) compared to prior work

Category (Example methods)	Detects subgroup with large decay?	Valid hypothesis test?	Avoids detailed causal graph?	Detailed explanations for outcome/covariate shifts?
Detect any shift (Rabanser et al., 2019; Zhang et al., 2011)	No	Yes	Yes	Outcome only
Detect loss shift (Podkopaev & Ramdas, 2022)	No	Yes	Yes	No
Decompose average perf decay (Zhang et al., 2023; Cai et al., 2023; Quintas-Martinez et al., 2024 )	No	Some methods	No	Covariate only
Decompose shift variability (Singh et al., 2024)	No	No	Yes	Outcome & Covariate
Decompose ATE (Baron & Kenny, 1986)	No	Parametric only	No	Covariate only
Decompose CATE variability (Hines et al., 2023)	No	No	Yes	Outcome only
Subgroup discovery (Eyuboglu et al., 2022; d’Eon et al., 2022)	No	Some	Yes	No
SHIFT (Proposed)	Yes	Yes	Yes	Outcome & Covariate

Open in a new tab

Detecting distribution shifts.

Many methods have been developed to detect any shift in marginal/conditional distributions, such as Kolmogorov-Smirnov (KS) (Rabanser et al., 2019), kernel-based tests (Zhang et al., 2011), and Maximum Mean Discrepancy (MMD) (Gretton et al., 2012a; Luedtke et al., 2018). More recent works focus on detecting only those that are harmful to overall performance (Podkopaev & Ramdas, 2022; Panda et al., 2024). No prior methods have been developed to specifically detect distribution shifts that lead to disproportionate harm in a (sufficiently large) subgroup, which SHIFT aims to address.

Decomposing model performance.

Various methods have been developed to quantify the contribution of each feature subset to the average performance (Budhathoki et al., 2021; Cai et al., 2023; Wu et al., 2021; Zhang et al., 2023; Quintas-Martinez et al., 2024) and, more recently, the variability of performance changes (Singh et al., 2024). Mathematically, these methods rely on techniques similar to those used in mediation analysis for decomposing the average treatment effect into indirect and direct effects (Baron & Kenny, 1986) and variable importance (VI) methods for explaining the variability of the conditional average treatment effect (CATE) (Hines et al., 2023), respectively. Most of these methods either assume a parametric model or knowledge of the causal graph between individual variables. While VI methods that focus on decomposing variability have much weaker assumptions (Hines et al., 2023; Singh et al., 2024), they generally require large datasets and their confidence intervals (CIs) cannot be easily inverted to produce valid hypothesis tests (the influence function is degenerate because the estimand is at the boundary of the parameter space under the null, leading to inflated Type I error rates) (Hudson, 2023). Through carefully framed hypothesis tests, SHIFT provides valid statistical inference without parametric assumptions or knowledge of a detailed causal graph.

Discovering subgroups.

Methods have been developed to identify subgroups with low performance within a single distribution (Eyuboglu et al., 2022; d’Eon et al., 2022; Ali et al., 2022; Feng et al., 2024a; Dong et al., 2024; Rauba et al., 2024; Subbaswamy et al., 2024) and subgroups with large CATE (Athey et al., 2019). However, most methods only provide point estimates and not statistical inference (CIs/hypothesis tests). More importantly, no existing methods can be directly adapted to explain how large performance decay arises across subgroups with respect to variable-specific shifts.

3. Hierarchical Testing Framework

Given a set of features $X$ and an outcome $Y$ , we want to understand the difference in performance of an algorithm $f$ across source and target domains, denoted by $d = 0$ and $d = 1$ respectively. We refer to the joint distribution of ( $X, Y$ ) in each domain by $p_{d}$ and its corresponding expectation with $E_{d}$ . Performance is quantified by a loss function $ℓ ≔ ℓ (y, f (x)) \in R$ . The average loss conditional on $x$ in domain $d$ is denoted $Z_{d} (x) ≔ E_{d} [ℓ (Y, f (X)) ∣ X = x]$ for $d \in {0,1}$ . Hat notation denotes an estimate.

A shift in the joint distribution of ( $X, Y$ ) can be decomposed into aggregate covariate and outcome shifts, which are defined by the shifts $p_{0} (x) \Rightarrow p_{1} (x)$ and $p_{0} (y ∣ x) \Rightarrow p_{1} (y ∣ x)$ , respectively. In this way, the shift from source to target can be broken down into a sequence of aggregate-level shifts:

p_{0} (x) p_{0} (y ∣ x) \Rightarrow p_{1} (x) p_{0} (y ∣ x) \Rightarrow p_{1} (x) p_{1} (y ∣ x)

(1)

and, correspondingly, the average performance change can be decomposed into:

E_{1} [ℓ] - E_{0} [ℓ] = \underset{covariate shift}{\underset{⏟}{E_{1} [Z_{0}] - E_{0} [Z_{0}]}} + \underset{outcome shift}{\underset{⏟}{E_{1} [Z_{1} - Z_{0}]}}

(2)

To generate even more detailed explanations of performance shifts, we will consider sparse shifts solely with respect to variable subsets $X_{s}$ and use $p_{s} (x)$ and $p_{s} (y ∣ x)$ to denote $X_{s}$ -specific covariate and outcome shifts, respectively. We will present their exact definitions later.

SHIFT is a hierarchical diagnostic framework that does a more detailed analysis of performance drift compared to the standard two-way decomposition in (2) by accounting for heterogeneity of performance shifts. At the first level, SHIFT checks if the aggregate covariate and outcome shifts lead to subgroups with large performance decay. If so, SHIFT searches for a more detailed explanation among candidate variable(subset)-specific shifts.

Throughout, SHIFT focuses only on subgroups of individuals and performance shifts that are deemed large enough to be of practical interest, by requiring the domain expert to select a priori the minimum subgroup size $ϵ > 0$ and minimum shift magnitude $τ \geq 0$ . This is critical to ensure the practical usability of these methods, as alarms for negligible shifts lead to alarm fatigue (Cvach, 2012; Feng et al., 2025). The set $𝒜_{ϵ}$ denotes all subgroups whose prevalence in the source and target domains exceed $ϵ > 0$ .

The following two sections (Sec 3.1 and 3.2) introduce the aggregate and detailed hypothesis tests in SHIFT and Section 4 describes the actual testing procedures.

3.1. Aggregate tests: Where?

SHIFT first tests if there exists a subgroup with large performance decay due to an aggregate covariate shift and, likewise, a subgroup impacted by an aggregate outcome shift. The impacts of these shifts within a subgroup $A \in 𝒜_{ϵ}$ is quantified using a similar decomposition as (2), i.e.

E_{1} [ℓ∣ X \in A] - E_{0} [ℓ∣ X \in A] = \underset{outcome shift}{\underset{⏟}{E_{1} [Z_{1} - Z_{0} ∣ X \in A]}} + \underset{covariate shift}{\underset{⏟}{E_{1} [Z_{0} ∣ X \in A] - E_{0} [Z_{0} ∣ X \in A]}} .

This leads to tests for the following null hypotheses:

Hypothesis 3.1 (Agg covariate shift). $H_{0}^{X}$ : For all subgroups $A \in 𝒜_{ϵ}$ , the performance drift in $A$ due to the aggregate covariate shift is no larger than tolerance $τ \geq 0$ , i.e. $E_{1} [Z_{0} (X) ∣ X \in A] - E_{0} [Z_{0} (X) ∣ X \in A] \leq τ$ .

Hypothesis 3.2 (Agg outcome shift). $H_{0}^{Y ∣ X}$ : For all subgroups $A \in 𝒜_{ϵ}$ , the performance drift in $A$ due to the aggregate outcome shift is no larger than tolerance $τ \geq 0$ , i.e. $E_{1} [Z_{1} (X) - Z_{0} (X) ∣ X \in A] \leq τ$ .

For each shift mechanism, rejection of the null means that there is a subgroup of concern and further investigation is warranted, thereby triggering a second stage of testing. Before diving into the second stage, we discuss connections between these aggregate tests and the existing literature.

Connection to MMD.

The proposed tests assess for distributional differences by comparing the maximum difference in the mean loss along the shift sequence in (1). This shares similarities to MMD, which also measures the distance between two distributions in terms of the maximum difference in expected value over some function class (often referred to as the “critic”) (Gretton et al., 2012a). To see the connection more formally, we rewrite the above tests in terms of binary detectors where $h_{A} (X) = 1 {X \in A}$ for subgroup $A$ . Define the critic function class to be the set of “filtered” loss functions $\{(x, y) \mapsto h_{A} (X) ℓ (f (x), y) : A \in 𝒜_{ϵ}\}$ . For the first two distributions in (1), MMD defines their distance as the maximum average difference of the filtered loss, i.e.

sup_{A \in 𝒜_{ϵ}} E_{10} [ℓ (X, Y) h_{A} (X)] - E_{00} [ℓ (X, Y) h_{A} (X)],

where $E_{d_{1}, d_{2}}$ indicates the expectation with respect to distribution $p_{d_{1}} (X) p_{d_{2}} (Y ∣ X)$ . In contrast, the aggregate covariate shift test can be viewed as measuring the maximum average difference of the conditional loss, i.e.

sup_{A \in 𝒜_{ϵ}} \frac{E_{10} [ℓ (X, Y) h_{A} (X)]}{E_{10} [h_{A} (X)]} - \frac{E_{00} [ℓ (X, Y) h_{A} (X)]}{E_{00} [h_{A} (X)]} .

A similar analogy holds for the aggregate outcome shift, which compares the last two distributions in (1). Thus, SHIFT can be viewed as testing the Maximum conditional-Mean Discrepancy (McMD) rather than the MMD. Like MMD, McMD is zero when the compared distributions are equal. Unlike MMD, McMD can be large even when the mean difference is large in only a small subgroup, reflecting its priority placed on algorithmic fairness.

Connection to mediation analysis.

Prior works have highlighted that the decomposition of average performance change into covariate and outcome shifts parallels the decomposition of the average treatment effect into indirect and direct effects, which is commonly analyzed in causal mediation analysis (Castro et al., 2020; Singh et al., 2024). As this work decomposes subgroup-specific performance changes, it parallels recent efforts in the nascent but growing field on analyzing the heterogeneity of causal effect decompositions (Loh et al., 2020; Rubinstein et al., 2023). The omnibus tests developed in this work may thus be useful for testing heterogeneous indirect/direct effects, an area that has not been addressed thus far. We discuss these connections further in Appendix A.

3.2. Detailed tests: How?

For each shift mechanism, rejection of the first-stage test implies that there is a subgroup for which performance change was large. The next step is to find a detailed explanation, by identifying the variables most likely to be responsible.

SHIFT finds explanations by searching over a suite of candidate shifts with respect to individual variables or variable subsets. Because the true causal graph is not typically known in practice, the set of all possible variable(subset)-specific shifts is exponentially large and a comprehensive search over all such shifts is computationally intractable. As such, SHIFT considers a restricted set of detailed candidate shifts as potential explanations. In this work, given a variable subset $X_{s}$ , we consider the following:

Outcome shift: We consider the candidate $p_{s} (y ∣ x) ≔ p_{1} (y ∣ x_{s}, μ_{0} (x))$ , where $μ_{0} (x) = p_{0} (y = 1 ∣ x)$ is the outcome probability at the source. This is similar to shifts considered in model recalibration (Steyerberg, 2009), where the shift is defined relative to the outcome’s original conditional probability in the source domain.
Covariate shift: We consider the candidate $p_{s} (x) ≔ p_{1} (x_{s}) p_{0} (x_{- s} ∣ x_{s})$ . Such a shift may occur, for instance, if $X_{s}$ precedes $X_{- s}$ causally and is commonly considered in prior works (Wu et al., 2021; Zhang et al., 2023; Singh et al., 2024).

Other candidate shifts are certainly possible (see Sec F) and we leave them to future work. Critically, unlike prior works that offer variable-level explanations of performance decay assuming these candidate shifts are actually true (Wu et al., 2021; Zhang et al., 2023), SHIFT does not assume that these candidate shifts are correctly specified because everything is conducted through the lens of hypothesis testing. Instead, SHIFT tests whether a candidate offers a good explanation.

Given candidate shifts, we now quantify how well they explain the heterogeneous performance changes in the data. We say that an aggregate covariate shift is well-explained by an $X_{s}$ -specific covariate shift if the performance change induced by the former is well-approximated by the latter across all subgroups $A$ , i.e. for all $A \in 𝒜_{ϵ}$ ,

E_{1} [Z_{0} ∣ X \in A] - E_{0} [Z_{0} ∣ X \in A] \approx E_{s} [Z_{0} ∣ X \in A] - E_{0} [Z_{0} ∣ X \in A],

where $E_{s} (X)$ is with respect to an $X_{s}$ -specific covariate shift. Similarly, an aggregate outcome shift is well-explained by an $X_{s}$ -specific outcome shift if for all $A \in 𝒜_{ϵ}$ ,

E_{1} [Z_{1} - Z_{0} ∣ X \in A] \approx E_{1} [Z_{s} - Z_{0} ∣ X \in A],

(3)

where $Z_{s} (X)$ is the expected loss under the candidate shift. This is formalized in detailed tests of $X_{s}$ -specific shifts with the following null hypotheses:

Hypothesis 3.3 ( $X_{s}$ -specific covariate shift). $H_{0, s}^{X}$ : For all subgroups $A \in 𝒜_{ϵ}$ and tolerance $τ$ , the candidate $X_{s}$ -specific covariate shift explains the performance change, i.e., $E_{1} [Z_{0} (X) ∣ X \in A] - E_{s} [Z_{0} (X) ∣ X \in A] \leq τ$ .

Hypothesis 3.4 ( $X_{s}$ -specific outcome shift). $H_{0, s}^{Y ∣ X}$ : For all subgroups $A \in 𝒜_{ϵ}$ and tolerance $τ$ , the candidate $X_{s}$ -specific outcome shift explains the performance change, i.e., $E_{1} [Z_{1} (X) - Z_{s} (X) ∣ X \in A] \leq τ$ .

If we fail to reject the null for an $X_{s}$ -specific covariate or outcome shift, SHIFT flags it as potentially important. Then for some prespecified $α > 0$ , the potentially important variable subsets for covariate and outcome shifts are

{\hat{𝒮}}_{n}^{shift} = \{s : p -value for H_{0, s}^{shift} > α\}

(4)

for $shift = Y ∣ X$ and $shift = X$ . A human expert can then verify which variables in ${\hat{𝒮}}_{n}^{shift}$ are the true root cause(s) and design targeted corrective actions.

Comparing the detailed and aggregate-level tests, one may notice that they have nearly the same mathematical structure and yet are interpreted differently to answer differing questions (where? versus how?). To see how this is possible, note that the tests could have been interpretted in the same way: aggregate-level tests check whether aggregate shifts are well-approximated by the zero function, i.e. whether $E_{1} [Z_{1} - Z_{0} ∣ X \in A] \approx 0$ and $E_{1} [Z_{0} ∣ X \in A] - E_{0} [Z_{0} ∣ X \in A] \approx 0$ , while the detailed tests check if aggregate shifts are well-approximated by candidate $X_{s}$ -specific shifts.

Remark 3.1 (Modified covariate shift tests). When we have features that are independent of the loss function, covariate shifts in such features may still be flagged which is undesirable. This occurs due to collider bias since conditioning on the subgroup $1 {x \in A}$ induces a correlation between the independent features and the loss function. Section B gives more details. As a remedy, we filter features that are uncorrelated with the loss function as a data preprocessing step and then run the covariate shift tests as usual.

3.3. Visualization of SHIFT

Results from SHIFT are visualized in a hierarchical plot (Fig 1), where “red” means “flagged” and “gray” means “not flagged.” At the aggregate level, the covariate/outcome shift is “flagged” if a subgroup was found to have large performance decay due to that shift mechanism (null was rejected). To interpret aggregate-level test results, we summarize the detected subgroup using rule-based decision sets (Lakkaraju et al., 2016), although other ML explainability methods can be used instead. At the detailed level, we flag variable(subset)-specific covariate/outcome shifts that may offer a potential explanation of the heterogeneous performance shifts (null was not rejected). “Flag strength” is one minus the p-value for aggregate-level tests and the p-value for detailed tests. Note that if none of the candidate sparse shifts are adequate explanations, one may need to explore alternative shift explanations (e.g. less sparse).

4. Inference Procedure

We now describe the inference procedure for tests introduced in the previous section. We begin with rewriting each hypothesis test in terms of a simple target of inference. This will illuminate the general approach we would like to take, as well as the technical challenges we will encounter.

To illustrate, note that the aggregate outcome test can be equivalently expressed as testing the null hypothesis

H_{0}^{Y ∣ X} : \underset{target of inference}{\underset{⏟}{sup_{A \in 𝒜_{ϵ}} E_{1} [(Z_{1} (X) - Z_{0} (X) - τ) h_{A} (X)]}} \leq 0 .

(5)

The target of inference can be interpretted as follows: $h_{A}$ is scaled by how much the difference in expected loss exceeds tolerance $τ$ , so the target of inference can be interpretted as the Maximum Expected Exceedence (MEE) between the last two distributions in the shift sequence in (1). Similarly, the detailed outcome test can be rewritten as

H_{0, s}^{Y ∣ X} : sup_{A \in 𝒜_{ϵ}} E_{1} [(Z_{1} (X) - Z_{s} (X) - τ) h_{A} (X)] \leq 0 .

(6)

The aggregate and detailed covariate tests can be interpreted similarly, though the scaling term is not as clean:

H_{0}^{X} : sup_{A \in 𝒜_{ϵ}} E_{0} [(Z_{0} (X) ({\tilde{π}}_{A} (X) - 1) - τ) h_{A} (X)] \leq 0

(7)

H_{0, s}^{X} : sup_{h \in 𝒜_{ϵ}} E_{0} [(Z_{0} (X) ({\tilde{π}}_{A} (X) - {\tilde{π}}_{s, A} (X)) - τ) h_{A} (X)] \leq 0

(8)

where ${\tilde{π}}_{A} (x) = \frac{p_{1} (x) E_{0} [h_{A} (X)]}{p_{0} (x) E_{1} [h_{A} (X)]}$ and ${\tilde{π}}_{s, A} (x) = \frac{p_{1} (x_{s}) E_{0} [h_{A} (X)]}{p_{0} (x_{s}) E_{s} [h_{A} (X)]}$ are scaled density ratios. Given this rewriting of the MEE, we can now discuss two technical challenges that we can resolve, in part, through sample splitting.

First, estimating a supremum over the infinite number of binary detectors $h_{A}$ is computationally intractable. Nevertheless, our goal is simply hypothesis testing, not estimation. We can accomplish this by sample splitting, where one portion of the data is for learning one (or a few) good candidate detector ( ${\hat{h}}_{A}$ ) and the remaining data is for testing the expected exceedence for ${\hat{h}}_{A}$ This can be viewed as running a restricted version of the original test, where we only test the MEE with respect to the singleton set $\{{\hat{h}}_{A}\}$ rather than all of $𝒜_{ϵ}$ While this approach may be conservative, it provides statistical guarantees with fewer assumptions and better finite sample behavior. The remaining question is how we can find good candidate detectors.

Second, the MEE involves unknown outcome models ( $Z_{d}$ ) and scaled density ratio models ( ${\hat{π}}_{A}$ and ${\hat{π}}_{s, A}$ ), which we collectively refer to as nuisance parameters. Prior works have shown that plug-in estimators, which use the same data to both train nuisance parameters and estimate targets of inference, are biased. Following results in double-debiased ML and semiparametric theory (Chernozhukov et al., 2018), we use sample splitting to remove some of this bias. The question is then how to remove the remaining bias for achieving the desired Type I error control.

Given the benefits of sample-splitting, the overall testing procedure uses this as the basis: Step 1 estimates candidate detectors and nuisance parameters on a training partition and Step 2 uses the remaining data to conduct a restricted test with respect to the fitted models (Figure 2). For ease of exposition, we describe the procedure for a single sample-split, but it can be easily extended with cross-fitting to improve statistical efficiency (Kennedy, 2024). Here we describe each step broadly and highlight key innovations needed to address the technical challenges. The detailed testing procedure (including hyperparameter selection) is given in Section C of the Appendix.

Figure 2: — Overview of testing procedure

Step 1. Estimate candidate detectors and nuisance parameters using the training partition.

The nuisance parameters can be estimated using ML following standard recipes (Kennedy, 2024). Estimating candidate detectors for the aggregate and detailed outcome shift tests is also straightforward. For the aggregate version, the estimand in (5) is maximized when the conditional mean $E_{1} [(Z_{1} (X) - Z_{0} (X) - τ) h_{A} (X) ∣ X]$ is maximized, so the optimal detector is $h_{A} (X) = 1 \{Z_{1} (X) - Z_{0} (X) - τ > 0\}$ . Consequently, we can take a plug-in approach to construct a candidate detector, i.e. ${\hat{h}}_{A} (X) = 1 \{{\hat{Z}}_{1} (X) - {\hat{Z}}_{0} (X) - τ > 0}$ . A similar approach can be taken for the detailed version.

Estimating candidate detectors for the covariate shift tests is, however, not immediately obvious. For instance, the MEE in (7) cannot be maximized by individually maximizing its conditional mean, because of the shared ratio term $E_{0} [h_{A} (X)] / E_{1} [h_{A} (X)]$ . Instead, we find the optimal detector by solving the dual for a sequence of optimization problems. That is, we can reframe the task as solving

sup_{A} E_{0} [({\hat{Z}}_{0} (X) (\hat{π} (X) ω - 1) - τ) h_{A} (X)] \leq 0 s.t. ω = E_{0} [h_{A} (X)] / E_{1} [h_{A} (X)]

(9)

for some $ω > 0$ . Using the method of Lagrange multipliers, the solution must have the form ${\hat{h}}_{A}^{(ω, λ)} (X) = 1 \{({\hat{Z}}_{0} (X) - λ) (\hat{π} (X) ω - 1) \geq 0\}$ for some $λ \geq 0$ . Thus we can estimate the optimal candidate detector by sweeping over a grid of $ω$ and $λ$ values. We can estimate detectors for detailed covariate shifts in a similar manner.

Step 2. Conduct double-debiased tests on held-out data.

On the remaining data, we construct asymptotically linear estimators for the MEE with respect to the fitted candidate detector(s), using the approach of one-step correction.² This is relatively straightforward for (5), (7), and (8) by noting the mathematical similarities between MEE and direct/indirect effects in causal mediation analysis. However, one-step correction for the detailed outcome shift does not follow from standard recipes, which require the target of inference to be pathwise differentiable. The problem is that (6) involves $Z_{s} (X)$ , which is not pathwise differentiable because its definition involves indicator functions. Still, we can sidestep this issue by leveraging the binning trick in Singh et al. (2024). Rather than defining an outcome shift as a function of $μ_{0} (x)$ , we define a binned outcome shift that replaces all occurences of $μ_{0}$ with a binned version. Assuming that the set of observations that fall exactly on the bin edges have measure zero, we can show that the MEE with respect to the binned outcome shift is pathwise differentiable, so to allow construction of an asymptotically linear estimator.

Theoretical properties.

Under the assumptions described in Appendix D, we can prove that the estimators for the MEE with respect to fitted detectors are asymptotically linear and their respective tests control the Type I error and have power one, asymptotically. Consequently, for outcome and covariate shifts ( $shift = Y ∣ X$ and $shift = X$ ), if there is a candidate detailed shift with respect to variable subset $s^{*, shift}$ that corresponds to the true shift, it will be flagged by SHIFT, i.e. $P (s^{*, shift} \notin 𝒮_{n}^{shift}) \leq α$ .

5. Results

We now validate SHIFT in simulation studies where the ground truth is known and two real-world case studies. For comprehensive validation, we vary the type and degree of shifts, the ML algorithms under study, and the data sizes. We present a summary of the results here due to space constraints and provide full experiment details in Section G.

SHIFT.

For all experiments, performance is defined in terms of the 0–1 misclassification loss. We fit ML models (e.g. gradient boosting trees (GBT)) for the nuisance parameters and detectors, with hyperparameters chosen through cross validation. The significance level is set to $α = 0.05$ .

Baseline methods.

There is no existing comparator that provides universal testing for all four types of shifts (aggregate/detailed and covariate/outcome) for the exact formulations used in SHIFT. Given these constraints, different comparators are used for different shift types and, when necessary, adapted to be as close as possible.

For aggregate shifts, we compare against Kernel independence tests KCI (Zhang et al., 2011) and MMD (Gretton et al., 2012b). For detailed outcome shifts, we compare against (a) TE-VIM (Hines et al., 2023) which quantifies VI for explaining conditional average treatment effect, (b) ParamY which fits a parametric regression model of the outcome $Y$ given domain $D$ , features $X$ , and interaction terms $D X$ and determines VI based on coefficients of the interaction terms, (c) ParamLoss which is the same as ParamY except it regresses loss $ℓ$ , and (d) KCI (Zhang et al., 2011) which is a kernel conditional independence test for $D ⊥ ℓ ∣ X_{s}$ . For detailed covariate shifts, we compare against (a) KS which is the classic Kolmogorov-Smirnov test for comparing two univariate distributions, (b) Score (Kulinski et al., 2020) which detects shifts in $X_{s} ∣ X_{- s}$ via the Fisher score, and (c) KCI (Zhang et al., 2011) which is a kernel conditional independence test for $D ⊥ X_{- s} ∣ X_{s}$ .

5.1. Simulations

Here we illustrate how SHIFT is more powerful and identifies only relevant shifts, i.e. those that contribute to performance drifts of magnitude $\geq τ$ in some subgroup with prevalence $\geq ϵ$ .

Data generating process.

We generate variables $X$ from a multivariate normal distribution centered at $m_{d}$ and covariance $Σ_{d}$ for domain $d$ and binary outcome $Y$ per logit $ϕ_{d} (x)$ . The ML algorithm is a logistic regression model fitted to data from the source domain. We take $n = 8000$ points from both source and target domains, and split them into halves for training and evaluation.

Setup 1a/b (Compare agg-level outcome/covariate tests): For $X \in R^{10}$ , the shift only occurs in subgroup $A = \{x ∣ x_{1} \notin [- 3.5, 3.5]\}$ . Setup 1a only shifts the outcome logits per $ϕ_{1} (x) = ϕ_{0} (x) - 0.6 x_{1} 1 {x \in A}$ ; Setup 1b only shifts the mean of the first covariate. To make the tests comparable, SHIFT tests for $τ = 0, ϵ = 0.05$ .

Setup 2 (Compare detailed outcome test): $ϕ_{0} (x) = 0.8 x_{1} + 0.5 x_{2} + x_{3} + 0.6 x_{4}$ and $ϕ_{1} (x) = 0.2 x_{1} + 0.4 x_{2} + x_{3} + 0.6 x_{4}$ . The outcome shifts with respect to both $X_{1}$ and $X_{2}$ , but the shift in $X_{2}$ is minimal and below tolerance $τ$ . Accuracy drops by 5.9%. SHIFT tests for $τ = 0.05, ϵ = 0.05$ .

Setup 3 (Compare detailed covariate test): $m_{0} = (1,0, 0,1)$ , $Σ_{0} = diag (2, 2, 2, 2)$ and $m_{1} = (0,0, 0,0)$ , $Σ_{1} = diag (1, 2, 2, 2)$ . Both $X_{1}$ and $X_{4}$ shift but $X_{4}$ ’s shift is very small and below tolerance $τ$ . Accuracy drops by 5.4%. SHIFT tests for $τ = 0.02, ϵ = 0.05$ .

SHIFT correctly identifies relevant shifts, achieves nominal type-I error rate, and is consistent.

In Setups 1a/b, the aggregate-level tests in SHIFT are considerably more powerful than KCI and MMD, which are both kernel-based methods that tend to do poorly in high dimensions (Table 2). In contrast, SHIFT takes advantage of flexible ML estimators, which allows it to recover the true subgroup $A$ with reasonable accuracy (73.7% and 41.9% in setups 1a and 1b, respectively). In Setups 2 and 3, the aggregate-level tests in SHIFT also correctly flag outcome shifts (Fig 3a) and covariate shifts (Fig 3b), respectively. At the detailed level, SHIFT correctly flags variable $X_{1}$ as being a good explanation for the large performance shifts; the others are ignored because they either do not contribute or have negligible impacts. In Appendix I, we also show that SHIFT controls the Type-I error rate and is consistent (asymptotically power-one).

Table 2: Aggregate tests.

Power for detecting outcome or covariate shifts in a subgroup. Power is computed as the rejection rate among 25 random draws of the dataset. We observe that SHIFT has the highest power.

Setup	SHIFT	`KCI`	`MMD`
1a Outcome	0.56 (0.42,0.7)	0.26 (0.16,0.4)	0.06 (0.02,0.16)
1b Covariate	0.94 (0.84,0.98)	0.0 (0.0,0.0)	0.46 (0.32,0.6)

Open in a new tab

Figure 3: — SHIFT shown in outlined boxes; baselines for covariate and outcome shifts shown on the bottom left and right, respectively. Null hypotheses either state that a shift should be flagged (†), in which case we flag it in red if the p-value > 0.05 and show the p-value in the colored box, or that a shift should *not* be flagged (‡), in which case we flag it if the p-value ≤ 0.05 and show 1 – p-value. For synthetic Setups 2 and 3, we report median p-values over 50 randomly-sampled datasets.

Comparators do not flag the correct shifts.

For comparators in the detailed outcome test (Setup 2), we find the following: TE-VIM does not find any variable able to explain heterogeneity of performance drift because it has weird behavior at the null. KCI can only check if the “marginal” distribution of the loss can be explained by individual variables, i.e. if the loss distribution is independent of $D$ given $X_{j}$ . This is a very specific type of explanation and does not hold, and so KCI fails to find any good explanation. ParamLoss is an incorrectly specified model and thus incorrectly flags none of the variables. ParamY is correctly specified model so it flags $X_{1}$ and $X_{2}$ as shifting, which is correct though does not respect the specified tolerance. Similar issues are found in the detailed covariate test (Setup 3). KCI, KS, and Score all incorrectly flag both $X_{1}$ and $X_{4}$ even though $X_{4}$ is irrelevant. This is because they check if $X_{4}$ has shifted, but do not account for the fact that $X_{4}$ is not actually used by the model nor does it affect $Y$ in any capacity.

5.2. Real-world case studies

Health insurance prediction across states.

We study performance drift of an MLP trained to predict public health insurance coverage using census data from Nebraska, which is subsequently applied to Louisiana. Datasets have 3166 and 12000 points respectively and 34 features. Accuracy drops by 13.7% on average. At the aggregate-level, SHIFT finds that both outcome and covariate shifts affect subgroup-level accuracy (Fig 3c). For example, accuracy for the subgroup detected by the aggregate outcome test, comprising 50.6% of the target data, decays by 19.4%. Grouping the variables (34 in total) into 5 broad categories, we find from the detailed tests from SHIFT that shifts with respect to demographics can explain subgroup-level decay due to both shift types. Similar to that in simulations, the KCI baseline method struggles to find a good explanation while TE-VIM only flags employment-related variables. Based on these findings, we compare three ways to fix the model: a standard non-targeted (Non-T) fix that retrains the model for everyone with respect to all variables; a fix that retrains the model for everyone with respect to only the employment-related variables identified by TE-VIM; and a very targeted fix that only updates the model for the subgroup and the demographic variables detected by SHIFT (Table 3). We find that the targeted fix does better than the non-targeted fixes, and the non-targeted fixes inadvertently decay performance in other subgroups.

Table 3: Comparing model updates on insurance study.

We report AUC and 95% CI for performance of the original model and targeted versus non-targeted model updates, as measured with respect to the overall population (left column) and the subgroups where the original model (Org) and non-targeted model updates (Non-T and TE-VIM) experience large performance decays (right three columns). The targeted update based on SHIFT results performs better along all dimensions.

Model	Overall	Subgroup for models
Model	Overall	`Org`	`Non-T`	`TE-VIM`
Original model, `Org`	69.2 (67.0,71.4)	59.0 (55.3,62.3)	—	—
Non-targeted update, `Non-T`	73.0 (71.0,75.0)	67.3 (64.0,70.4)	41.8 (24.6,59.2)	—
Update as per `TE-VIM` feats.	73.0 (71.0,75.1)	64.9 (61.6,67.9)	—	63.7 (56.6,70.5)
Targeted update as per `SHIFT`	74.8 (72.8,76.8)	67.7 (64.4,70.4)	66.4 (46.8,83.6)	65.0 (57.5,71.5)

Open in a new tab

Readmission prediction across hospitals.

The clinical AI field has developed numerous models to predict whether patients will be have an unplanned readmission after discharge from a hospital, which can be used to allocate extra resources to high-risk patients. We study a GBT readmission model trained on data from an academic hospital and transferred to a safety-net hospital. Since the hospitals serve different populations, the goal is to understand which exact shifts contributed the most to accuracy changes, such as changes in how patient variables are measured or changes in how care is delivered. Datasets from the academic and safety-net hospitals have 7468 and 6515 points, respectively, and 27 features. Accuracy on average decays by 6.1% when the model is transferred. SHIFT detects significant changes in subgroup-level accuracy due to both aggregate outcome and covariate shifts (Figure 3d). For instance, the subgroup detected by the aggregate covariate test (comprising 41.8% of target data) has a 15.4% drop in accuracy. We find that the top feature highlighted by SHIFT for both covariate and outcome shifts is num ED encounters. When the same variable is highlighted for both covariate and outcome shifts, it can indicate that the definition of the variable has shifted. Investigating the data extraction procedure further, we indeed find this to be the case: the encounters feature was extracted differently across the hospitals. After correcting the extraction of this feature, covariate shifts no longer lead to a significant subgroup-level accuracy drop (p-value for the aggregate covariate test is no longer significant). This illustrates how SHIFT can help bridge accuracy gaps.

Application to unstructured and high-dimensional data.

Although SHIFT is primarily designed for tabular data, its aggregate-level tests are suitable for analyzing unstructured data; its detailed-level tests can also be used, if one has prespecified concepts (Koh et al., 2020). As an example, we apply SHIFT to the CivilComments dataset (Koh et al., 2021), which contains comments on online articles and are judged to be toxic or not. Given 768-dimensional embeddings of the comments, SHIFT detects accuracy drops, as described in Section J.

6. Conclusion

We propose hypothesis tests to identify subgroups where an ML model decays in performance due to distribution shift across two contexts. The tests can also explain how the decay arises by checking for variable subset-specific shifts that can explain the decay. The tests can be configured to detect only meaningfully large performance decay and can be implemented readily using off-the-shelf ML models. Despite using ML estimators, we show that the tests have controlled false detection rate and good power asymptotically. Although the experiments here primarily focus on tabular data, SHIFT can be extended to unstructured data such as images and text by featurizing such data into concepts (Koh et al., 2020; Feng et al., 2024b). Our explorations with text data show that SHIFT provides a solid theoretical foundation on which future work can build.

Supplementary Material

NIHMS2128097-supplement-1.pdf^{(798.6KB, pdf)}

Impact Statement.

The methods in the work can identify subgroups that experience overly large performance decay when an ML algorithm is transferred across domains or used over time. Results from SHIFT can be used to suggest interventions that can improve the impacted subgroup’s performance, without substantively impacting other subgroups. We recommend working with domain experts to define what constitutes a meaningfully large subgroup and performance decay, as these determine what the test will aim to detect. These thresholds impact the interpretation of the tests and subsequent actions taken for closing performance drops.

Acknowledgements

We would like to thank Lucas Zier, Patrick Vossler, Avni Kothari, and Romain Pirracchio for their helpful input and comments on this work. We are especially grateful to Adarsh Subbaswamy, Nicholas Petrick, and Gene Pennello, who provided invaluable feedback on the project from its inception to completion. We thank them for their tireless commitment. This work was funded through a Patient-Centered Outcomes Research Institute^® (PCORI^®) Award (ME-2022C125619). The views presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of the PCORI^®, its Board of Governors or Methodology Committee, and the Food and Drug Administration. JCH acknowledges support from the National Cancer Institute of the National Institutes of Health (R01CA277782), which had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Footnotes

Code is available at http://github.com/jjfeng/shift.

Appendix C.3 discusses a more statistically efficient but more complex procedure involving the Maximum conditional Expectation of the Exceedence (McEE) rather than the MEE. We discuss testing of the MEE in the main manuscript for ease of exposition.

References

Ali A, Cauchois M, and Duchi JC The lifecycle of a statistical model: Model failure detection, identification, and refitting, 2022. URL https://arxiv.org/abs/2202.04166.
Athey S, Tibshirani J, and Wager S Generalized random forests. The Annals of Statistics, 47(2):1148 – 1178, 2019. doi: 10.1214/18-AOS1709. URL https://doi.org/10.1214/18-AOS1709. [DOI] [Google Scholar]
Baron RM and Kenny DA The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of personality and social psychology, 51 6:1173–82, 1986. URL https://api.semanticscholar.org/CorpusID:1925599. [DOI] [PubMed] [Google Scholar]
Belloni A and Chernozhukov V l₁-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82 – 130, 2011. doi: 10.1214/10-AOS827. URL https://doi.org/10.1214/10-AOS827. [DOI] [Google Scholar]
Budhathoki K, Janzing D, Bloebaum P, and Ng H Why did the distribution change? In Banerjee A and Fukumizu K (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1666–1674. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/budhathoki21a.html. [Google Scholar]
Cai TT, Namkoong H, and Yadlowsky S Diagnosing model performance under distribution shift. March 2023. URL http://arxiv.org/abs/2303.02011.
Castro DC, Walker I, and Glocker B Causality matters in medical imaging. Nat. Commun, 11(1):3673, July 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chernozhukov V, Chetverikov D, and Kato K Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786 – 2819, 2013. doi: 10.1214/13-AOS1161. URL https://doi.org/10.1214/13-AOS1161. [DOI] [Google Scholar]
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, and Robins J Double/debiased machine learning for treatment and structural parameters. Econom. J, 21(1):C1–C68, February 2018. [Google Scholar]
Cvach M Monitor alarm fatigue: an integrative review. Biomed. Instrum. Technol, 46(4):268–277, 2012. [DOI] [PubMed] [Google Scholar]
d’Eon G, d’Eon J, Wright JR, and Leyton-Brown K The spotlight: A general method for discovering systematic errors in deep learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ‘22, pp. 1962–1981, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533240. URL https://doi.org/10.1145/3531146.3533240. [DOI] [Google Scholar]
Ding F, Hardt M, Miller J, and Schmidt L Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021. [Google Scholar]
Dong S, Wang Q, Sahri S, Palpanas T, and Srivastava D Efficiently mitigating the impact of data drift on machine learning pipelines. Proc. VLDB Endow, 17(11):3072–3081, August 2024. ISSN 2150-8097. doi: 10.14778/3681954.3681984. URL https://doi.org/10.14778/3681954.3681984. [DOI] [Google Scholar]
Eyuboglu S, Varma M, Saab KK, Delbrouck J-B, Lee-Messer C, Dunnmon J, Zou J, and Re C Domino: Discovering systematic errors with cross-modal embeddings. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=FPCMqjI0jXN. [Google Scholar]
Feng J, Gossmann A, Pirracchio R, Petrick N, A Pennello G, and Sahiner B Is this model reliable for everyone? testing for strong calibration. In Dasgupta S, Mandt S, and Li Y (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 181–189. PMLR, 02–04 May 2024a. URL https://proceedings.mlr.press/v238/feng24a.html. [Google Scholar]
Feng J, Kothari A, Zier L, Singh C, and Tan YS Bayesian concept bottleneck models with LLM priors. NeurIPS Workshop on Statistical Frontiers in LLMs and Foundation Models, October 2024b. [Google Scholar]
Feng J, Xia F, Singh K, and Pirracchio R Not all clinical AI monitoring systems are created equal: Review and recommendations. NEJM AI, 2(2), January 2025. [Google Scholar]
Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, and Saria S The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3):283–286, 2021. doi: 10.1056/NEJMc2104626. URL https://www.nejm.org/doi/full/10.1056/NEJMc2104626. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghosh B, Malioutov D, and Meel KS Efficient learning of interpretable classification rules. Journal of Artificial Intelligence Research, 74:1823–1863, 2022. [Google Scholar]
Globus-Harris I, Kearns M, and Roth A An algorithmic framework for bias bounties. In 2022 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, June 2022. ACM. [Google Scholar]
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012a. URL http://jmlr.org/papers/v13/gretton12a.html. [Google Scholar]
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A A kernel two-sample test. J. Mach. Learn. Res, 13(25):723–773, 2012b. [Google Scholar]
Hebert-Johnson U, Kim M, Reingold O, and Rothblum G Multicalibration: Calibration for the (Computationally-identifiable) masses. In Dy J and Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1939–1948. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html. [Google Scholar]
Hindy A, Luo R, Banerjee S, Kuck J, Schmerling E, and Pavone M Diagnostic runtime monitoring with martingales, 2024. URL https://arxiv.org/abs/2407.21748.
Hines O, Diaz-Ordaz K, and Vansteelandt S Variable importance measures for heterogeneous causal effects, 2023.
Hsu Y Consistent tests for conditional treatment effects. The Econometrics Journal, 20(1):1–22, March 2017. ISSN 1368–4221. doi: 10.1111/ectj.12077. URL https://doi.org/10.1111/ectj.12077. [DOI] [Google Scholar]
Hudson A Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space, 2023. URL https://arxiv.org/abs/2306.07492.
Kearns M, Neel S, Roth A, and Wu ZS Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Dy J and Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2564–2572. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kearns18a.html. [Google Scholar]
Kennedy EH Semiparametric doubly robust targeted double machine learning: A review. In Handbook of Statistical Methods for Precision Medicine, pp. 207–236. Chapman and Hall/CRC, Boca Raton, 1st edition edition, October 2024. [Google Scholar]
Kim MP, Ghorbani A, and Zou J Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ‘19, pp. 247–254, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450363242. doi: 10.1145/3306618.3314287. URL https://doi.org/10.1145/3306618.3314287. [DOI] [Google Scholar]
Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B, and Liang P Concept bottleneck models. In III HD and Singh A (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/koh20a.html. [Google Scholar]
Koh PW, Sagawa S, Marklund H, Xie SM, Zhang M, Balsubramani A, Hu W, Yasunaga M, Phillips RL, Gao I, Lee T, David E, Stavness I, Guo W, Earnshaw B, Haque I, Beery SM, Leskovec J, Kundaje A, Pierson E, Levine S, Finn C, and Liang P Wilds: A benchmark of in-the-wild distribution shifts. In Meila M and Zhang T (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5637–5664. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/koh21a.html. [Google Scholar]
Kulinski S, Bagchi S, and Inouye DI Feature shift detection: Localizing which features have shifted via conditional distribution tests. In Larochelle H, Ranzato M, Hadsell R, Balcan M, and Lin H (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 19523–19533. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/e2d52448d36918c575fa79d88647ba66-Paper.pdf. [Google Scholar]
Lakkaraju H, Bach SH, and Leskovec J Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16, pp. 1675–1684, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939874. URL https://doi.org/10.1145/2939672.2939874. [DOI] [Google Scholar]
Liu J, Wang T, Cui P, and Namkoong H On the need for a language describing distribution shifts: Illustrations on tabular datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=PFOlxayYST. [Google Scholar]
Loh WW, Moerkerke B, Loeys T, and Vansteelandt S Heterogeneous indirect effects for multiple mediators using interventional effect models. Epidemiol. Method, 9(1), January 2020. [Google Scholar]
Luedtke A, Carone M, and van der Laan MJ An Omnibus Non-Parametric Test of Equality in Distribution for Unknown Functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, November 2018. ISSN 1369–7412. doi: 10.1111/rssb.12299. URL https://doi.org/10.1111/rssb.12299. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitchell S, Potash E, Barocas S, D’Amour A, and Lum K Algorithmic fairness: Choices, assumptions, and definitions. Annu. Rev. Stat. Appl, 8(1):141–163, March 2021. [Google Scholar]
OpenAI, :, Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, Clark A, and et al. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276. Accessed on March 28, 2025.
Panda P, Kancheti SS, Balasubramanian VN, and Sinha G Interpretable model drift detection. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), CODS-COMAD ‘24, pp. 1–9, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400716348. doi: 10.1145/3632410.3632434. URL https://doi.org/10.1145/3632410.3632434. [DOI] [Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]
Podkopaev A and Ramdas A Tracking the risk of a deployed model and detecting harmful distribution shifts. In International Conference on Learning Representations, 2022. [Google Scholar]
Quinonero-Candela J, Sugiyama M, Schwaighofer A, and Lawrence ND Dataset Shift in Machine Learning. The MIT Press, 2009. [Google Scholar]
Quintas-Martinez V, Bahadori MT, Santiago E, Mu J, and Heckerman D Multiply-robust causal change attribution. In Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, and Berkenkamp F (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 41821–41840. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/quintas-martinez24a.html. [Google Scholar]
Quinzan F, Soleymani A, Jaillet P, Rojas CR, and Bauer S DRCFS: Doubly robust causal feature selection. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 28468–28491. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/quinzan23a.html. [Google Scholar]
Rabanser S, Günnemann S, and Lipton Z Failing loudly: An empirical study of methods for detecting dataset shift. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/846c260d715e5b854ffad5f70a516c88-Paper.pdf. [Google Scholar]
Rauba P, Seedat N, Luyten MR, and van der Schaar M Context-aware testing: A new paradigm for model testing with large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=d75qCZb7TX. [Google Scholar]
Rubinstein M, Branson Z, and Kennedy EH Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches. J. Causal Inference, 11(1), July 2023. [Google Scholar]
Sanh V, Debut L, Chaumond J, and Wolf T Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. [Google Scholar]
Singh H, Xia F, Subbaswamy A, Gossmann A, and Feng J A hierarchical decomposition for explaining ML performance discrepancies. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024. [Google Scholar]
Srivastava M, Nushi B, Kamar E, Shah S, and Horvitz E An empirical analysis of backward compatibility in machine learning systems. In KDD, August 2020. [Google Scholar]
Steyerberg EW Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, New York, NY, 2009. [Google Scholar]
Subbaswamy A, Sahiner B, Petrick N, Pai V, Adams R, Diamond MC, and Saria S A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform. NPJ Digit. Med, 7(1):334, November 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sugiyama M, Nakajima S, Kashima H, Buenau P, and Kawanabe M Direct importance estimation with model selection and its application to covariate shift adaptation. Adv. Neural Inf. Process. Syst, 2007. [Google Scholar]
Suriyakumar VM, Ghassemi M, and Ustun B When personalization harms performance: Reconsidering the use of group attributes in prediction. Proc. Int. Conf. Mach. Learn, 2023. [Google Scholar]
van der Vaart AW Asymptotic Statistics. Cambridge University Press, October 1998. [Google Scholar]
Wager S and Walther G Adaptive concentration of regression trees, with application to random forests. arXiv preprint arXiv:1503.06388, 2015. [Google Scholar]
Williamson BD, Gilbert PB, Carone M, and Simon N Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021. doi: 10.1111/biom.13392. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu E, Wu K, and Zou J Explaining medical AI performance disparities across sites with confounder shapley value analysis. November 2021. URL http://arxiv.org/abs/2111.08168. [Google Scholar]
Yang Y, Zhang H, Katabi D, and Ghassemi M Change is hard: A closer look at subpopulation shift. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 39584–39622. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/yang23s.html. [Google Scholar]
Zahn M. v., Hinz O, and Feuerriegel S Locating disparities in machine learning. In 2023 IEEE International Conference on Big Data (BigData), pp. 1883–1894, 2023. doi: 10.1109/BigData59044.2023.10386485. [DOI] [Google Scholar]
Zhang H, Singh H, Ghassemi M, and Joshi S “Why did the model fail?”: Attributing model performance changes to distribution shifts. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 41550–41578. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/zhang23ai.html. [Google Scholar]
Zhang K, Peters J, Janzing D, and Schölkopf B Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, pp. 804–813, Arlington, Virginia, USA, 2011. AUAI Press. ISBN 9780974903972. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS2128097-supplement-1.pdf^{(798.6KB, pdf)}

[R1] Ali A, Cauchois M, and Duchi JC The lifecycle of a statistical model: Model failure detection, identification, and refitting, 2022. URL https://arxiv.org/abs/2202.04166.

[R2] Athey S, Tibshirani J, and Wager S Generalized random forests. The Annals of Statistics, 47(2):1148 – 1178, 2019. doi: 10.1214/18-AOS1709. URL https://doi.org/10.1214/18-AOS1709. [DOI] [Google Scholar]

[R3] Baron RM and Kenny DA The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of personality and social psychology, 51 6:1173–82, 1986. URL https://api.semanticscholar.org/CorpusID:1925599. [DOI] [PubMed] [Google Scholar]

[R4] Belloni A and Chernozhukov V l₁-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39(1):82 – 130, 2011. doi: 10.1214/10-AOS827. URL https://doi.org/10.1214/10-AOS827. [DOI] [Google Scholar]

[R5] Budhathoki K, Janzing D, Bloebaum P, and Ng H Why did the distribution change? In Banerjee A and Fukumizu K (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1666–1674. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/budhathoki21a.html. [Google Scholar]

[R6] Cai TT, Namkoong H, and Yadlowsky S Diagnosing model performance under distribution shift. March 2023. URL http://arxiv.org/abs/2303.02011.

[R7] Castro DC, Walker I, and Glocker B Causality matters in medical imaging. Nat. Commun, 11(1):3673, July 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chernozhukov V, Chetverikov D, and Kato K Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786 – 2819, 2013. doi: 10.1214/13-AOS1161. URL https://doi.org/10.1214/13-AOS1161. [DOI] [Google Scholar]

[R9] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, and Robins J Double/debiased machine learning for treatment and structural parameters. Econom. J, 21(1):C1–C68, February 2018. [Google Scholar]

[R10] Cvach M Monitor alarm fatigue: an integrative review. Biomed. Instrum. Technol, 46(4):268–277, 2012. [DOI] [PubMed] [Google Scholar]

[R11] d’Eon G, d’Eon J, Wright JR, and Leyton-Brown K The spotlight: A general method for discovering systematic errors in deep learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ‘22, pp. 1962–1981, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533240. URL https://doi.org/10.1145/3531146.3533240. [DOI] [Google Scholar]

[R12] Ding F, Hardt M, Miller J, and Schmidt L Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021. [Google Scholar]

[R13] Dong S, Wang Q, Sahri S, Palpanas T, and Srivastava D Efficiently mitigating the impact of data drift on machine learning pipelines. Proc. VLDB Endow, 17(11):3072–3081, August 2024. ISSN 2150-8097. doi: 10.14778/3681954.3681984. URL https://doi.org/10.14778/3681954.3681984. [DOI] [Google Scholar]

[R14] Eyuboglu S, Varma M, Saab KK, Delbrouck J-B, Lee-Messer C, Dunnmon J, Zou J, and Re C Domino: Discovering systematic errors with cross-modal embeddings. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=FPCMqjI0jXN. [Google Scholar]

[R15] Feng J, Gossmann A, Pirracchio R, Petrick N, A Pennello G, and Sahiner B Is this model reliable for everyone? testing for strong calibration. In Dasgupta S, Mandt S, and Li Y (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 181–189. PMLR, 02–04 May 2024a. URL https://proceedings.mlr.press/v238/feng24a.html. [Google Scholar]

[R16] Feng J, Kothari A, Zier L, Singh C, and Tan YS Bayesian concept bottleneck models with LLM priors. NeurIPS Workshop on Statistical Frontiers in LLMs and Foundation Models, October 2024b. [Google Scholar]

[R17] Feng J, Xia F, Singh K, and Pirracchio R Not all clinical AI monitoring systems are created equal: Review and recommendations. NEJM AI, 2(2), January 2025. [Google Scholar]

[R18] Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, Kohane IS, and Saria S The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3):283–286, 2021. doi: 10.1056/NEJMc2104626. URL https://www.nejm.org/doi/full/10.1056/NEJMc2104626. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Ghosh B, Malioutov D, and Meel KS Efficient learning of interpretable classification rules. Journal of Artificial Intelligence Research, 74:1823–1863, 2022. [Google Scholar]

[R20] Globus-Harris I, Kearns M, and Roth A An algorithmic framework for bias bounties. In 2022 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, June 2022. ACM. [Google Scholar]

[R21] Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012a. URL http://jmlr.org/papers/v13/gretton12a.html. [Google Scholar]

[R22] Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A A kernel two-sample test. J. Mach. Learn. Res, 13(25):723–773, 2012b. [Google Scholar]

[R23] Hebert-Johnson U, Kim M, Reingold O, and Rothblum G Multicalibration: Calibration for the (Computationally-identifiable) masses. In Dy J and Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1939–1948. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html. [Google Scholar]

[R24] Hindy A, Luo R, Banerjee S, Kuck J, Schmerling E, and Pavone M Diagnostic runtime monitoring with martingales, 2024. URL https://arxiv.org/abs/2407.21748.

[R25] Hines O, Diaz-Ordaz K, and Vansteelandt S Variable importance measures for heterogeneous causal effects, 2023.

[R26] Hsu Y Consistent tests for conditional treatment effects. The Econometrics Journal, 20(1):1–22, March 2017. ISSN 1368–4221. doi: 10.1111/ectj.12077. URL https://doi.org/10.1111/ectj.12077. [DOI] [Google Scholar]

[R27] Hudson A Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space, 2023. URL https://arxiv.org/abs/2306.07492.

[R28] Kearns M, Neel S, Roth A, and Wu ZS Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Dy J and Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2564–2572. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kearns18a.html. [Google Scholar]

[R29] Kennedy EH Semiparametric doubly robust targeted double machine learning: A review. In Handbook of Statistical Methods for Precision Medicine, pp. 207–236. Chapman and Hall/CRC, Boca Raton, 1st edition edition, October 2024. [Google Scholar]

[R30] Kim MP, Ghorbani A, and Zou J Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ‘19, pp. 247–254, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450363242. doi: 10.1145/3306618.3314287. URL https://doi.org/10.1145/3306618.3314287. [DOI] [Google Scholar]

[R31] Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B, and Liang P Concept bottleneck models. In III HD and Singh A (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/koh20a.html. [Google Scholar]

[R32] Koh PW, Sagawa S, Marklund H, Xie SM, Zhang M, Balsubramani A, Hu W, Yasunaga M, Phillips RL, Gao I, Lee T, David E, Stavness I, Guo W, Earnshaw B, Haque I, Beery SM, Leskovec J, Kundaje A, Pierson E, Levine S, Finn C, and Liang P Wilds: A benchmark of in-the-wild distribution shifts. In Meila M and Zhang T (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5637–5664. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/koh21a.html. [Google Scholar]

[R33] Kulinski S, Bagchi S, and Inouye DI Feature shift detection: Localizing which features have shifted via conditional distribution tests. In Larochelle H, Ranzato M, Hadsell R, Balcan M, and Lin H (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 19523–19533. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/e2d52448d36918c575fa79d88647ba66-Paper.pdf. [Google Scholar]

[R34] Lakkaraju H, Bach SH, and Leskovec J Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16, pp. 1675–1684, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939874. URL https://doi.org/10.1145/2939672.2939874. [DOI] [Google Scholar]

[R35] Liu J, Wang T, Cui P, and Namkoong H On the need for a language describing distribution shifts: Illustrations on tabular datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=PFOlxayYST. [Google Scholar]

[R36] Loh WW, Moerkerke B, Loeys T, and Vansteelandt S Heterogeneous indirect effects for multiple mediators using interventional effect models. Epidemiol. Method, 9(1), January 2020. [Google Scholar]

[R37] Luedtke A, Carone M, and van der Laan MJ An Omnibus Non-Parametric Test of Equality in Distribution for Unknown Functions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1):75–99, November 2018. ISSN 1369–7412. doi: 10.1111/rssb.12299. URL https://doi.org/10.1111/rssb.12299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Mitchell S, Potash E, Barocas S, D’Amour A, and Lum K Algorithmic fairness: Choices, assumptions, and definitions. Annu. Rev. Stat. Appl, 8(1):141–163, March 2021. [Google Scholar]

[R39] OpenAI, :, Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, Clark A, and et al. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276. Accessed on March 28, 2025.

[R40] Panda P, Kancheti SS, Balasubramanian VN, and Sinha G Interpretable model drift detection. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), CODS-COMAD ‘24, pp. 1–9, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400716348. doi: 10.1145/3632410.3632434. URL https://doi.org/10.1145/3632410.3632434. [DOI] [Google Scholar]

[R41] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]

[R42] Podkopaev A and Ramdas A Tracking the risk of a deployed model and detecting harmful distribution shifts. In International Conference on Learning Representations, 2022. [Google Scholar]

[R43] Quinonero-Candela J, Sugiyama M, Schwaighofer A, and Lawrence ND Dataset Shift in Machine Learning. The MIT Press, 2009. [Google Scholar]

[R44] Quintas-Martinez V, Bahadori MT, Santiago E, Mu J, and Heckerman D Multiply-robust causal change attribution. In Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, and Berkenkamp F (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 41821–41840. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/quintas-martinez24a.html. [Google Scholar]

[R45] Quinzan F, Soleymani A, Jaillet P, Rojas CR, and Bauer S DRCFS: Doubly robust causal feature selection. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 28468–28491. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/quinzan23a.html. [Google Scholar]

[R46] Rabanser S, Günnemann S, and Lipton Z Failing loudly: An empirical study of methods for detecting dataset shift. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/846c260d715e5b854ffad5f70a516c88-Paper.pdf. [Google Scholar]

[R47] Rauba P, Seedat N, Luyten MR, and van der Schaar M Context-aware testing: A new paradigm for model testing with large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=d75qCZb7TX. [Google Scholar]

[R48] Rubinstein M, Branson Z, and Kennedy EH Heterogeneous interventional effects with multiple mediators: Semiparametric and nonparametric approaches. J. Causal Inference, 11(1), July 2023. [Google Scholar]

[R49] Sanh V, Debut L, Chaumond J, and Wolf T Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. [Google Scholar]

[R50] Singh H, Xia F, Subbaswamy A, Gossmann A, and Feng J A hierarchical decomposition for explaining ML performance discrepancies. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024. [Google Scholar]

[R51] Srivastava M, Nushi B, Kamar E, Shah S, and Horvitz E An empirical analysis of backward compatibility in machine learning systems. In KDD, August 2020. [Google Scholar]

[R52] Steyerberg EW Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, New York, NY, 2009. [Google Scholar]

[R53] Subbaswamy A, Sahiner B, Petrick N, Pai V, Adams R, Diamond MC, and Saria S A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform. NPJ Digit. Med, 7(1):334, November 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Sugiyama M, Nakajima S, Kashima H, Buenau P, and Kawanabe M Direct importance estimation with model selection and its application to covariate shift adaptation. Adv. Neural Inf. Process. Syst, 2007. [Google Scholar]

[R55] Suriyakumar VM, Ghassemi M, and Ustun B When personalization harms performance: Reconsidering the use of group attributes in prediction. Proc. Int. Conf. Mach. Learn, 2023. [Google Scholar]

[R56] van der Vaart AW Asymptotic Statistics. Cambridge University Press, October 1998. [Google Scholar]

[R57] Wager S and Walther G Adaptive concentration of regression trees, with application to random forests. arXiv preprint arXiv:1503.06388, 2015. [Google Scholar]

[R58] Williamson BD, Gilbert PB, Carone M, and Simon N Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021. doi: 10.1111/biom.13392. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] Wu E, Wu K, and Zou J Explaining medical AI performance disparities across sites with confounder shapley value analysis. November 2021. URL http://arxiv.org/abs/2111.08168. [Google Scholar]

[R60] Yang Y, Zhang H, Katabi D, and Ghassemi M Change is hard: A closer look at subpopulation shift. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 39584–39622. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/yang23s.html. [Google Scholar]

[R61] Zahn M. v., Hinz O, and Feuerriegel S Locating disparities in machine learning. In 2023 IEEE International Conference on Big Data (BigData), pp. 1883–1894, 2023. doi: 10.1109/BigData59044.2023.10386485. [DOI] [Google Scholar]

[R62] Zhang H, Singh H, Ghassemi M, and Joshi S “Why did the model fail?”: Attributing model performance changes to distribution shifts. In Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, and Scarlett J (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 41550–41578. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/zhang23ai.html. [Google Scholar]

[R63] Zhang K, Peters J, Janzing D, and Schölkopf B Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, pp. 804–813, Arlington, Virginia, USA, 2011. AUAI Press. ISBN 9780974903972. [Google Scholar]

PERMALINK

“Who experiences large model decay and why?” A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Harvineet Singh

Fan Xia

Alexej Gossmann

Andrew Chuang

Julian C Hong

Jean Feng

Abstract

1. Introduction

Figure 1:

2. Related Work

Table 1:

Detecting distribution shifts.

Decomposing model performance.

Discovering subgroups.

3. Hierarchical Testing Framework

3.1. Aggregate tests: Where?

Connection to MMD.

Connection to mediation analysis.

3.2. Detailed tests: How?

3.3. Visualization of SHIFT

4. Inference Procedure

Figure 2:

Step 1. Estimate candidate detectors and nuisance parameters using the training partition.

Step 2. Conduct double-debiased tests on held-out data.

Theoretical properties.

5. Results

SHIFT.

Baseline methods.

5.1. Simulations

Data generating process.

SHIFT correctly identifies relevant shifts, achieves nominal type-I error rate, and is consistent.

Table 2: Aggregate tests.

Figure 3: Hypothesis testing results for variable(subset)-specific shifts.

Comparators do not flag the correct shifts.

5.2. Real-world case studies

Health insurance prediction across states.

Table 3: Comparing model updates on insurance study.

Readmission prediction across hospitals.

Application to unstructured and high-dimensional data.

6. Conclusion

Supplementary Material

Impact Statement.

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases