All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously

Aaron Fisher; Cynthia Rudin; Francesca Dominici

. Author manuscript; available in PMC: 2021 Jul 30.

Published in final edited form as: J Mach Learn Res. 2019;20:177.

All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously

Aaron Fisher ¹, Cynthia Rudin ², Francesca Dominici ³

PMCID: PMC8323609 NIHMSID: NIHMS1670270 PMID: 34335110

Abstract

Variable importance (VI) tools describe how much covariates contribute to a prediction model’s accuracy. However, important variables for one well-performing model (for example, a linear model f (x) = x^T β with a fixed coefficient vector β) may be unimportant for another model. In this paper, we propose model class reliance (MCR) as the range of VI values across all well-performing model in a prespecified class. Thus, MCR gives a more comprehensive description of importance by accounting for the fact that many prediction models, possibly of different parametric forms, may fit the data well. In the process of deriving MCR, we show several informative results for permutation-based VI estimates, based on the VI measures used in Random Forests. Specifically, we derive connections between permutation importance estimates for a single prediction model, U-statistics, conditional variable importance, conditional causal effects, and linear model coefficients. We then give probabilistic bounds for MCR, using a novel, generalizable technique. We apply MCR to a public data set of Broward County criminal records to study the reliance of recidivism prediction models on sex and race. In this application, MCR can be used to help inform VI for unknown, proprietary models.

Keywords: Rashomon, permutation importance, conditional variable importance, U-statistics, transparency, interpretable models

1. Introduction

Variable importance (VI) tools describe how much a prediction model’s accuracy depends on the information in each covariate. For example, in Random Forests, VI is measured by the decrease in prediction accuracy when a covariate is permuted (Breiman, 2001; Breiman et al., 2001; see also Strobl et al., 2008; Altmann et al., 2010; Zhu et al., 2015; Gregorutti et al., 2015; Datta et al., 2016; Gregorutti et al., 2017). A similar “Perturb” VI measure has been used for neural networks, where noise is added to covariates (Recknagel et al., 1997; Yao et al., 1998; Scardi and Harding, 1999; Gevrey et al., 2003). Such tools can be useful for identifying covariates that must be measured with high precision, for improving the transparency of a “black box” prediction model (see also Rudin, 2019), or for determining what scenarios may cause the model to fail.

However, existing VI measures do not generally account for the fact that many prediction models may fit the data almost equally well. In such cases, the model used by one analyst may rely on entirely different covariate information than the model used by another analyst. This common scenario has been called the “Rashomon” effect of statistics (Breiman et al., 2001; see also Lecué, 2011; Statnikov et al., 2013; Tulabandhula and Rudin, 2014; Nevo and Ritov, 2017; Letham et al., 2016). The term is inspired by the 1950 Kurosawa film of the same name, in which four witnesses offer different descriptions and explanations for the same encounter. Under the Rashomon effect, how should analysts give comprehensive descriptions of the importance of each covariate? How well can one analyst recover the conclusions of another? Will the model that gives the best predictions necessarily give the most accurate interpretation?

To address these concerns, we analyze the set of prediction models that provide near-optimal accuracy, which we refer to as a Rashomon set. This approach stands in contrast to training to select a single prediction model, among a prespecified class of candidate models. Our motivation is that Rashomon sets (defined formally below) summarize the range of effective prediction strategies that an analyst might choose. Additionally, even if the candidate models do not contain the true data generating process, we may hope that some of these models function in similar ways to the data generating process. In particular, we may hope there exist well performing candidate models that place the same importance on a variable of interest as the underlying data generating process does. If so, then studying sets of well-performing models will allow us to deduce information about the data generating process.

Applying this approach to study variable importance, we define model class reliance (MCR) as the highest and lowest degree to which any well-performing model within a given class may rely on a variable of interest for prediction accuracy. Roughly speaking, MCR captures the range of explanations, or mechanisms, associated with well-performing models. Because the resulting range summarizes many prediction models simultaneously, rather a single model, we expect this range to be less affected by the choices that an individual analyst makes during the model-fitting process. Instead of reflecting these choices, MCR aims to reflect the nature of the prediction problem itself.

We make several, specific technical contributions in deriving MCR. First, we review a core measure of how much an individual prediction model relies on covariates of interest for its accuracy, which we call model reliance (MR). This measure is based on permutation importance measures for Random Forests (Breiman et al., 2001; Breiman, 2001), and can be expanded to describe conditional importance (see Section 8, as well as Strobl et al. 2008). We draw a connection between permutation-based importance estimates (MR) and U-statistics, which facilitates later theoretical results. Additionally, we derive connections between MR, conditional causal effects, and coefficients for additive models. Expanding on MR, we propose MCR, which generalizes the definition of MR for a class of models. We derive finite-sample bounds for MCR, which motivate an intuitive estimator of MCR. Finally, we propose computational procedures for this estimator.

The tools we develop to study Rashomon sets are quite general, and can be used to make finite-sample inferences for arbitrary characteristics of well-performing models. For example, beyond describing variable importance, these tools can describe the range of risk predictions that well-fitting models assign to a particular covariate profile, or the variance of predictions made by well-fitting models. In some cases, these novel techniques may provide finite-sample confidence intervals (CIs) where none have previously existed (see Section 5).

MCR and the Rashomon effect become especially relevant in the context of criminal recidivism prediction. Proprietary recidivism risk models trained from criminal records data are increasingly being used in U.S. courtrooms. One concern is that these models may be relying on information that would otherwise be considered unacceptable (for example, race, sex, or proxies for these variables), in order to estimate recidivism risk. The relevant models are often proprietary, and cannot be studied directly. Still, in cases where the predictions made by these models are publicly available, it may be possible to identify alternative prediction models that are sufficiently similar to the proprietary model of interest.

In this paper, we specifically consider the proprietary model COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), developed by the company Northpointe Inc. (subsequently, in 2017, Northpointe Inc.,Courtview Justice Solutions Inc., and Constellation Justice Systems Inc. joined together under the name Equivant). Our goal is to estimate how much COMPAS relies on either race, sex, or proxies for these variables not measured in our data set. To this end, we apply a broad class of flexible, kernel-based prediction models to predict COMPAS score. In this setting, the MCR interval reflects the highest and lowest degree to which any prediction model in our class can rely on race and sex while still predicting COMPAS score relatively accurately. Equipped with MCR, we can relax the common assumption of being able to correctly specify the unknown model of interest (here, COMPAS) up to a parametric form. Instead, rather than assuming that the COMPAS model itself is contained in our class, we assume that our class contains at least one well-performing alternative model that relies on sensitive covariates to the same degree that COMPAS does. Under this assumption, the MCR interval will contain the VI value for COMPAS. Applying our approach, we find that race, sex, and their potential proxy variables, are likely not the dominant predictive factors in the COMPAS score (see analysis and discussion in Section 10).

The remainder of this paper is organized as follows. In Section 2 we introduce notation, and give a high level summary of our approach, illustrated with visualizations. In Sections 3 and 4 we formally present MR and MCR respectively, and derive theoretical properties of each. We also review related variable importance practices in the literature, such as retraining a model after removing one of the covariates. In Section 5, we discuss general applicability of our approach for determining finite-sample CIs for other problems. In Section 6, we present a general procedure for computing MCR. In Section 7, we give specific implementations of this procedure for (regularized) linear models, and linear models in a reproducing kernel Hilbert space. We also show that, for additive models, MR can be expressed in terms of the model’s coefficients. In Section 8 we outline connections between MR, causal inference, and conditional variable importance. In Section 9, we illustrate MR and MCR with a simulated toy example, to aid intuition. We also present simulation studies for the task of estimating MR for an unknown, underlying conditional expectation function, under misspecification. We analyze a well-known public data set on recidivism in Section 10, described above. All proofs are presented in the appendices.

2. Notation & Technical Summary

The label of “variable importance” measure has been broadly used to describe approaches for either inference (van der Laan, 2006; Díaz et al., 2015; Williamson et al., 2017) or prediction. While these two goals are highly related, we primarily focus on how much prediction models rely on covariates to achieve accuracy. We use terms such as “model reliance” rather than “importance” to clarify this context.

In order to evaluate how much prediction models rely on variables, we now introduce notation for random variables, data, classes of prediction models, and loss functions for evaluating predictions. Let $Z = (Y, X_{1}, X_{2}) \in Z$ be a random variable with outcome $Y \in Y$ and covariates $X = (X_{1}, X_{2}) \in X$ , where the covariate subsets $X_{1} \in X_{1}$ and $X_{2} \in X_{2}$ may each be multivariate. We assume that observations of Z are iid, that n ≥ 2, and that solutions to arg min and arg max operations exist whenever optimizing over sets mentioned in this paper (for example, in Theorem 4, below). Our goal is to study how much different prediction models rely on X₁ to predict Y.

We refer to our data set as Z = [ y X ], a matrix composed of a n-length outcome vector y in the first column, and a n × p covariate matrix X = [ X₁ X₂ ] in the remaining columns. In general, for a given vector v, let v_[j] denote its j^th element(s). For a given matrix A, let A′, A_[i,·], A_[·,j] and A_[i,j] respectively denote the transpose of A, the i^th row(s) of A, the j^th column(s) of A, and the element(s) in the i^th row(s) and j^th column(s) of A.

We use the term model class to refer to a prespecified subset $F \subset {f | f : X \to Y}$ of the measurable functions from $X$ to $Y$ . We refer to member functions $f \in F$ as prediction models, or simply as models. Given a model f, we evaluate its performance using a nonnegative loss function $L : (F \times Z) \to R_{\geq 0}$ . For example, L may be the squared error loss L_se(f, (y, x₁, x₂)) = (y − f (x₁, x₂))² for regression, or the hinge loss L_h(f, (y, x₁, x₂)) = (1 − y f (x₁, x₂))₊ for classification. We use the term algorithm to refer to any procedure $A : Z^{n} \to F$ that takes a data set as input and returns a model $f \in F$ as output.

2.1. Summary of Rashomon Sets & Model Class Reliance

Many traditional statistical estimates come from descriptions of a single, fitted prediction model. In contrast, in this section, we summarize our approach for studying a set of near-optimal models. To define this set, we require a prespecified “reference” model, denoted by f_ref, to serve as a benchmark for predictive performance. For example, f_ref may come from a flowchart used to predict injury severity in a hospital’s emergency room, or from another quantitative decision rule that is currently implemented in practice. Given a reference model f_ref, we define a population ϵ-Rashomon set as the subset of models with expected loss no more than ϵ above that of f_ref. We denote this set as $R (ϵ) : = {f \in F : E L (f, Z) \leq E L (f_{ref}, Z) + ϵ}$ , where $E$ denotes expectations with respect to the population distribution. This set can be thought of as representing models that might be arrived at due to differences in data measurement, processing, filtering, model parameterization, covariate selection, or other analysis choices (see Section 4).

Figure 1–A illustrates a hypothetical example of a population ϵ-Rashomon set. Here, the y-axis shows the expected loss of each model $f \in F$ , and the x-axis shows how much each model relies on X₁ for its predictive accuracy. More specifically, given a prediction model f, the x-axis shows the percent increase in f’s expected loss when noise is added to X₁. We refer to this measure as the model reliance (MR) of f on X₁, written informally as

M R (f) : = \frac{Expected loss of f under noise}{Expected loss of f without noise} .

(2.1)

The added noise must satisfy certain properties, namely, it must render X₁ completely uninformative of the outcome Y, without altering the marginal distribution of X₁ (for details, see Section 3, as well as Breiman, 2001; Breiman et al., 2001).

Our central goal is to understand how much, or how little, models may rely on covariates of interest (X₁) while still predicting well. In Figure 1–A, this range of possible MR values is shown by the highlighted interval along the x-axis. We refer to an interval of this type as a population-level model class reliance (MCR) range (see Section 4), formally defined as

[{M C R}_{-} (ϵ), {M C R}_{+} (ϵ)] : = [\min_{f \in R (ϵ)} M R (f), \max_{f \in R (ϵ)} M R (f)] .

(2.2)

To estimate this range, we use empirical analogues of the population ϵ-Rashomon set, and of MR, based on observed data (Figure 1–B). We define an empirical ϵ-Rashomon set as the set of models with in-sample loss no more than ϵ above that of f_ref, and denote this set by $\hat{R} (ϵ)$ . Informally, we define the empirical MR of a model f on X₁ as

\hat{M R} (f) : = \frac{In-sample loss of f under noise}{In-sample loss of f without noise},

(2.3)

that is, the extent to which f appears to rely on X₁ in a given sample (see Section 3 for details). Finally, we define the empirical model class reliance as the range of empirical MR values corresponding to models with strong in-sample performance (see Section 4), formally written as

[{\hat{M C R}}_{-} (ϵ), {\hat{M C R}}_{+} (ϵ)] : = [\min_{f \in \hat{R} (ϵ)} \hat{M R} (f), \max_{f \in \hat{R} (ϵ)} \hat{M R} (f)] .

(2.4)

In Figure 1–B, the above range is shown by the highlighted portion of the x-axis.

We make several technical contributions in the process of developing MCR.

Estimation of MR, and population-level MCR: Given f, we show desirable properties of $\hat{M R} (f)$ as an estimator of MR(f), using results for U-statistics (Section 3.1 and Theorem 5). We also derive finite sample bounds for population-level MCR, some of which require a limit on the complexity of $F$ in the form of a covering number. These bounds demonstrate that, under fairly weak conditions, empirical MCR provides a sensible estimate of population-level MCR (see Section 4 for details).
Computation of empirical MCR: Although empirical MCR is fully determined given a sample, the minimization and maximization in Eq 2.4 require nontrivial computations. To address this, we outline a general optimization procedure for MCR (Section 6). We give detailed implementations of this procedure for cases when the model class $F$ is a set of (regularized) linear regression models, or a set of regression models in a reproducing kernel Hilbert space (Section 7). The output of our proposed procedure is a closed-form, convex envelope containing $F$ , which can be used to approximate empirical MCR for any performance level ϵ (see Figure 2 for an illustration). Still, for complex model classes where standard empirical loss minimization is an open problem (for example, neural networks), computing empirical MCR remains an open problem as well.
Interpretation of MR in terms of model coefficients, and causal effects: We show that MR for an additive model can be written as a function of the model’s coefficients (Proposition 15), and that MR for a binary covariate X₁ can be written as a function of the conditional causal effects of X₁ on Y (Proposition 19).
Extensions to conditional importance: We provide an extension of MR that is analogous to the notion of conditional importance (Strobl et al., 2008). This extension describes how much a model relies on the specific information in X₁ that cannot otherwise be gleaned from X₂ (Section 8.2).
Generalizations for Rashomon sets: Beyond notions of variable importance, we also generalize our finite sample results for MCR to describe arbitrary characterizations of models in a population ϵ-Rashomon set. As we discuss in concurrent work (Coker et al., 2018), this generalization is analogous to the profile likelihood interval, and can, for example, be used to bound the range of risk predictions that well-performing prediction models may assign to a particular set of covariates (Section 5).

Figure 2: — Illustration of output from our empirical MCR computational procedure – Our computation procedure produces a closed-form, convex envelope that contains $F$ (shown above as the solid, purple line), which bounds empirical MCR for any value of ϵ (see Eq 2.4). The procedure works sequentially, tightening these bounds as much as possible near the ϵ value of interest (Section 6). The results from our data analysis (Figure 8) are presented in the same format as the above purple envelope.

We begin in the next section by formally reviewing model reliance.

3. Model Reliance

To formally describe how much the expected accuracy of a fixed prediction model f relies on the random variable X₁, we use the notion of a “switched” loss where X₁ is rendered uninformative. Throughout this section, we will treat f as a pre-specified prediction model of interest (as in Hooker, 2007). Let $Z^{(a)} = (Y^{(a)}, X_{1}^{(a)}, X_{2}^{(a)})$ and $Z^{(b)} = (Y^{(b)}, X_{1}^{(b)}, X_{2}^{(b)})$ be independent random variables, each following the same distribution as Z = (Y, X₁, X₂). We define

e_{s w i t c h} (f) : = E L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})}

as representing the expected loss of model f across pairs of observations $(Z^{(a)}, Z^{(b)})$ in which the values of $X_{1}^{(a)}$ and $X_{1}^{(b)}$ have been switched. To see this interpretation of the above equation, note that we have used the variables $(Y^{(b)}, X_{2}^{(b)})$ from $Z^{(b)}$ , but we have used the variable $X_{1}^{(b)}$ from an independent copy Z^(b). This is why we say that $X_{1}^{(a)}$ and $X_{1}^{(b)}$ have been switched; the values of ( $Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)}$ ) do not relate to each other as they would if they had been chosen together. An alternative interpretation of e_switch(f) is as the expected loss of f when noise is added to X₁ in such a way that X₁ becomes completely uninformative of Y, but that the marginal distribution of X₁ is unchanged.

As a reference point, we compare e_switch(f) against the standard expected loss when none of the variables are switched, $e_{o r i g} (f) : = E L (f, (Y, X_{1}, X_{2})) .$ From these two quantities, we formally define model reliance (MR) as the ratio,

M R (f) : = \frac{e_{switch} (f)}{e_{orig} (f)},

(3.1)

as we alluded to in Eq 2.1. Higher values of MR(f) signify greater reliance of f on X₁. For example, an MR(f) value of 2 means that the model relies heavily on X₁, in the sense that its loss doubles when X₁ is scrambled. An MR(f) value of 1 signifies no reliance on X₁, in the sense that the model’s loss does not change when X₁ is scrambled. Models with reliance values strictly less than 1 are more difficult to interpret, as they rely less on the variable of interest than a random guess. Interestingly, it is possible to have models with reliance less than one. For instance, a model f′ may satisfy MR(f′) < 1 if it treats X₁ and Y as positively correlated when they are in fact negatively correlated. However, in many cases, the existence of a model $f^{'} \in F$ satisfying $M R (f^{'}) < 1$ implies the existence of another, better performing model $f^{″} \in F$ satisfying $M R (f^{″}) < 1$ and e_orig(f″) ≤ e_orig(f′). That is, although models may exist with MR values less than 1, they will typically be suboptimal (see Appendix A.2).

Model reliance could alternatively be defined as a difference rather than a ratio, that is, as ${M R}_{difference} (f) : = e_{switch} (f) - e_{orig} (f) .$ In Appendix A.5, we discuss how many of our results remain similar under either definition.

3.1. Estimating Model Reliance with U-statistics, and Connections to Permutation-based Variable Importance

Given a model f and data set $Z = [y X]$ , we estimate MR(f) by separately estimating the numerator and denominator of Eq 3.1. We estimate e_orig(f) with the standard empirical loss,

{\hat{e}}_{orig} (f) : = \frac{1}{n} \sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [i, \cdot]}, X_{2 [i, \cdot]})} .

(3.2)

We estimate e_switch(f) by performing a “switch” operation across all observed pairs, as in

{\hat{e}}_{switch} (f) : = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} L {f, (y_{[j]}, X_{1 [i, \cdot]}, X_{2 [j, \cdot]})} .

(3.3)

Above, we have aggregated over all possible combinations of the observed values for (Y, X₂) and for X₁, excluding pairings that are actually observed in the original sample. If the summation over all possible pairs (Eq 3.3) is computationally prohibitive due to sample size, another estimator of e_switch(f) is

{\hat{e}}_{divide} (f) : = \frac{1}{2 ⌊ n / 2 ⌋} \sum_{i = 1}^{⌊ n / 2 ⌋} [L {f, (y_{[i]}, X_{1 [i + ⌊ n / 2 ⌋, \cdot]}, X_{2 [i, \cdot]})}

(3.4)

+ L {f, (y_{[i + ⌊ n / 2 ⌋]}, X_{1 [i, \cdot]}, X_{2 [i + ⌊ n / 2 ⌋, \cdot]})}] .

(3.5)

Here, rather than summing over all pairs, we divide the sample in half. We then match the first half’s values for (Y, X₂) with the second half’s values for X₁ (Line 3.4), and vice versa (Line 3.5). All three of the above estimators (Eqs 3.2, 3.3 & 3.5) are unbiased for their respective estimands, as we discuss in more detail shortly.

Finally, we can estimate MR(f) with the plug-in estimator

\hat{M R} (f) : = \frac{{\hat{e}}_{switch} (f)}{{\hat{e}}_{orig} (f)},

(3.6)

which we define as the empirical model reliance of f on X₁. In this way, we formalize the empirical MR definition in Eq 2.3.

Again, our definition of empirical MR is very similar to the permutation-based variable importance approach of Breiman (2001), where Breiman uses a single random permutation and we consider all possible pairs. To compare these two approaches more precisely, let (π₁, … , π_n!} be a set of n-length vectors, each containing a different permutation of the set (1, …, n}. The approach of Breiman (2001) is analogous to computing the loss $\sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [π_{l_{[i]}}, .]}, X_{2 [i, .]})}$ for a randomly chosen permutation vector π_l ∈ {π₁, … , π_n!}. Similarly, our calculation in Eq 3.3 is proportional to the sum of losses over all possible (n!) permutations, excluding the n unique combinations of the rows of X₁ and the rows of $[X_{2} y]$ that appear in the original sample (see Appendix A.3). Excluding these observations is necessary to preserve the (finite-sample) unbiasedness of ${\hat{e}}_{switch} (f)$ .

The estimators ${\hat{e}}_{orig} (f)$ , ${\hat{e}}_{switch} (f)$ and ${\hat{e}}_{divide} (f)$ all belong to the well-studied class of U-statistics. Thus, under fairly minor conditions, these estimators are unbiased, asymptotically normal, and have finite-sample probabilistic bounds (Hoeffding, 1948, 1963; Serfling, 1980; see also DeLong et al., 1988 for an early use of U-statistics in machine learning, as well as caveats in Demler et al., 2012). To our knowledge, connections between permutation-based importance and U-statistics have not been previously established.

While the above results from U-statistics depend on the model f being fixed a priori, we can also leverage these results to create uniform bounds on the MR estimation error for all models in a sufficiently regularized class $F$ . We formally present this bound in Section 4 (Theorem 5), after introducing required conditions on model class complexity. The existence of this uniform bound implies that it is feasible to train a model and to evaluate its importance using the same data. This differs from the classical VI approach of Random Forests (Breiman, 2001), which avoids in-sample importance estimation. There, each tree in the ensemble is fit on a random subset of data, and VI for the tree is estimated using the held-out data. The tree-specific VI estimates are then aggregated to obtain a VI estimate for the overall ensemble. Although sample-splitting approaches such as this are helpful in many cases, the uniform bound for MR suggests that they are not strictly necessary, depending on the sample size and the complexity of $F$ .

3.2. Limitations of Existing Variable Importance Methods

Several common approaches for variable selection, or for describing relationships between variables, do not necessarily capture a variable’s importance. Null hypothesis testing methods may identify a relationship, but do not describe the relationship’s strength. Similarly, checking whether a variable is included by a sparse model-fitting algorithm, such as the Lasso (Hastie et al., 2009), does not describe the extent to which the variable is relied on. Partial dependence plots (Breiman et al., 2001; Hastie et al., 2009) can be difficult to interpret if multiple variables are of interest, or if the prediction model contains interaction effects.

Another common VI procedure is to run a model-fitting algorithm twice, first on all of the data, and then again after removing X₁ from the data set. The losses for the two resulting models are then compared to determine the importance, or “necessity,” of X₁ (Gevrey et al., 2003). Because this measure is a function of two prediction models rather than one, it does not measure how much either individual model relies on X₁. We refer to this approach as measuring empirical Algorithm Reliance (AR) on X₁, as the model-fitting algorithm is the common attribute between the two models. Related procedures were proposed by Breiman et al. (2001); Breiman (2001), which measure the sufficiency of X₁.

As we discuss in Section 3.1, the permutation-based VI measure from RFs (Breiman, 2001; Breiman et al., 2001) forms the inspiration for our definition of MR. This RF VI measure has been the topic of empirical studies (Archer and Kimes, 2008; Calle and Urrea, 2010; Wang et al., 2016), and several variations of the measure have been proposed (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014). Mentch and Hooker (2016) use U-statistics to study predictions of ensemble models fit to subsamples, similar to the bootstrap aggregation used in RFs. Procedures related to “Mean Difference Impurity,” another VI measure derived for RFs, have been studied theoretically by Louppe et al. (2013); Kazemitabar et al. (2017). All of this literature focuses on VI measures for RFs, for ensembles, or for individual trees. Our estimator for model reliance differs from the traditional RF VI measure (Breiman, 2001) in that we permute inputs to the overall model, rather than permuting the inputs to each individual ensemble member. Thus, our approach can be used generally, and is not limited to trees or ensemble models.

Outside of the context of RF VI, Zhu et al. (2015) propose an estimand similar to our definition of model reliance, and Gregorutti et al. (2015, 2017) propose an estimand analogous to e_switch(f) − e_orig(f). These recent works focus on the model reliance of f on X₁ specifically when f is equal to the conditional expectation function of Y (that is, $f (x_{1}, x_{2}) = E [Y | X_{1} = x_{1}, X_{2} = x_{2}]$ ). In contrast, we consider model reliance for arbitrary prediction models f. Datta et al. (2016) study the extent to which a model’s predictions are expected to change when a subset of variables is permuted, regardless of whether the permutation affects a loss function L. These VI approaches are specific to a single prediction model, as is MR. In the next section, we consider a more general conception of importance: how much any model in a particular set may rely on the variable of interest.

4. Model Class Reliance

Like many statistical procedures, our MR measure (Section 3) produces a description of a single predictive model. Given a model with high predictive accuracy, MR describes how much the model’s performance hinges on covariates of interest (X₁). However, there will often be many other models that perform similarly well, and that rely on X₁ to different degrees. With this notion in mind, we now study how much any well-performing model from a prespecified class $F$ may rely on covariates of interest.

Recall from Section 2.1 that, in order to define a population ϵ-Rashomon set of near-optimal models, we must choose a “reference” model f_ref to serve as a performance benchmark. In order to discuss this choice, we now introduce more explicit notation for the population ϵ-Rashomon set, written as

R (ϵ, f_{ref}, F) : = {f \in F : e_{orig} (f) \leq e_{orig} (f_{ref}) + ϵ} .

(4.1)

Note that we write $R (ϵ, f_{ref}, F)$ and $R (ϵ)$ interchangeably when f_ref and $F$ are clear from context. Similarly, we occasionally write empirical ϵ-Rashomon sets using the more explicit notation $\hat{R} (ϵ, f_{ref}, F) : = {f \in F : {\hat{e}}_{orig} (f) \leq {\hat{e}}_{orig} (f_{ref}) + ϵ}$ , but typically abbreviate these sets as $\hat{R} (ϵ)$ .

While f_ref could be selected by minimizing the in-sample loss, the theoretical study of $R (ϵ, f_{ref}, F)$ is simplified under the assumption that f_ref is prespecified. For example, f_ref may come from a flowchart used to predict injury severity in a hospital’s emergency room, or from another quantitative decision rule that is currently implemented in practice. The model f_ref can also be selected using sample splitting. In some cases it may be desirable to fix f_ref equal to the best-in-class model $f ⋆ : = \arg \min_{f \in F} e_{o r i g} (f)$ , but this is generally infeasible because f^⋆ is unknown. Still, for any $f_{ref} \in F$ , the Rashomon set $R (ϵ, f_{ref}, F)$ defined using f_ref will always be conservative in the sense that it contains the Rashomon set $R (ϵ, f^{⋆}, F)$ defined using f^⋆.

We can now formalize our definitions of population-level MCR and empirical MCR by simply plugging in our definitions for MR(f) and $\hat{M R} (f)$ (Section 3) into Eqs 2.2 & 2.4 respectively. Studying population-level MCR (Eq 2.2) is the main focus of this paper, as it provides a more comprehensive view of importance than measures from a single model. If MCR₊(ϵ) is low, then no well-performing model in $F$ places high importance on X₁, and X₁ can be discarded at low cost regardless of future modeling decisions. If MCR₋(ϵ) is large, then every well-performing model in $F$ must rely substantially on X₁, and X₁ should be given careful attention during the modeling process. Here, $F$ may itself consist of several parametric model forms (for example, all linear models and all decision tree models with less than 6 single-split nodes). We stress that the range [MCR₋(ϵ), MCR₊(ϵ)| does not depend on the fitting algorithm used to select a model $f \in F$ . The range is valid for any algorithm producing models in $F$ F, and applies for any $f \in F$ .

In the remainder of this section, we derive finite sample bounds for population-level MCR, from which we argue that empirical MCR provides reasonable estimates of population-level MCR (Section 4.1). In Appendix B.7 we consider an alternate formulation of Rashomon sets and MCR where we replace the relative loss threshold in the definition of $R (ϵ)$ with an absolute loss threshold. This alternate formulation can be similar in practice, but still requires the specification of a reference function f_ref to ensure that $R (ϵ)$ and $\hat{R} (ϵ)$ are nonempty.

4.1. Motivating Empirical Estimators of MCR by Deriving Finite-sample Bounds

In this section we derive finite-sample, probabilistic bounds for MCR₊(ϵ) and MCR₋(ϵ). Our results imply that, under minimal assumptions, ${\hat{M C R}}_{+} (ϵ)$ and ${\hat{M C R}}_{-} (ϵ)$ are respectively within a neighborhood of MCR₊ (ϵ) and MCR₋(ϵ) with high probability. However, the weakness of our assumptions (which are typical for statistical-learning-theoretic analysis) renders the width of our resulting CIs to be impractically large, and so we use these results only to show conditions under which ${\hat{M C R}}_{-} (ϵ)$ and ${\hat{M C R}}_{-} (ϵ)$ form sensible point estimates. In Sections 9.1 & 10, below, we apply a bootstrap procedure to account for sampling variability.

To derive these results we introduce three bounded loss assumptions, each of which can be assessed empirically. Let b_orig, B_ind, B_ref, $B_{switch} \in R$ be known constants.

Assumption 1 (Bounded individual loss) For a given model $f \in F$ , assume that $0 \leq L (f, (y, x_{1}, x_{2})) \leq B_{i n d} for any (y, x_{1}, x_{2}) \in (Y \times X_{1} \times X_{2})$ .

Assumption 2 (Bounded relative loss) For a given model $f \in F$ , assume that $| L (f, (y, x_{1}, x_{2})) - L (f_{ref}, (y, x_{1}, x_{2}) | \leq B_{r e f} for any (y, x_{1}, x_{2}) \in Z$ .

Assumption 3 (Bounded aggregate loss) For a given model $f \in F$ , assume that $P {0 < b_{o r i g} \leq {\hat{e}}_{o r i g} (f)} = P {{\hat{e}}_{s w i t c h} (f) \leq B_{s w i t c h}} = 1$ .

Each assumption is a property of a specific model $f \in F$ . The notation B_ind and B_ref refer to bounds for any individual observation, and the notation b_orig and B_switch refer to bounds on the aggregated loss L in a sample. These boundedness assumptions are central to our finite sample guarantees, shown below.

Crucially, loss functions L that are unbounded in general may be used so long as L(f, (y, x₁, x₂)) is bounded on a particular domain. For example, the squared-error loss can be used if $Y$ is contained within a known range, and predictions f (x₁, x₂) are contained within the same range for $(x_{1}, x_{2}) \in X \times X_{2} .$ We give example methods of determining B_ind in Sections 7.3.2 & 7.4.2. For Assumption 3, we can approximate b_orig by training a highly flexible model to the data, and setting b_orig equal to half (or any positive fraction) of the resulting cross-validated loss. To determine B_switch we can simply set B_switch = B_ind, although this may be conservative. For example, in the case of binary classification models for non-separable groups (see Section 9.1), no linear classifier can misclassify all observations, particularly after a covariate is permuted. Thus, it must hold that B_ind > B_switch. Similarly, if f_ref satisfies Assumption 1, then B_ref may be conservatively set equal to B_ind. If model reliance is redefined as a difference rather than a ratio, then a similar form of the results in this section will apply without Assumption 3 (see Appendix A.5).

Based on these assumptions, we can create a finite-sample upper bound for MCR₊(ϵ) and lower bound for MCR₋(ϵ). In other words, we create an “outer” bound that contains the interval [MCR₋(ϵ),MCR₊(ϵ)| with high probability.

Theorem 4 (“Outer” MCR Bounds) Given a constant ϵ ≥ 0, let $f_{+, ϵ} \in \arg \max_{R (ϵ)} M R (f)$ and $f_{-, ϵ} \in \arg \min_{R (ϵ)} M R (f)$ be prediction models that attain the highest and lowest model reliance among models in $R (ϵ)$ . If f_+,ϵ and f_−,ϵ satisfy Assumptions 1, 2 & 3, then

P (M C R_{+} (ϵ) > {\hat{M C R}}_{+} (ϵ_{o u t}) + Q_{o u t}) \leq δ, a n d

(4.2)

P (M C R_{-} (ϵ) < {\hat{M C R}}_{-} (ϵ_{o u t}) - Q_{o u t}) \leq δ,

(4.3)

where $ϵ_{o u t} : = ϵ + 2 B_{r e f} \sqrt{\frac{\log (3 δ^{- 1})}{2 n}}, {and Q}_{out} : = \frac{B_{s w i t c h}}{b_{o r i g}} - \frac{B_{s w i t c h} - B_{i n d} \sqrt{\frac{\log (6 δ^{- 1})}{n}}}{b_{o r i g} + B_{i n d} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}}$ .

Eq 4.2 states that, with high probability, MCR₊(ϵ) is no higher than ${\hat{M C R}}_{+} (ϵ_{out})$ added to an error term $Q_{out}$ . As n increases, ϵ_out approaches ϵ and $Q_{out}$ approaches zero. One practical implication is that, roughly speaking, if ${\hat{M C R}}_{+} (ϵ) \approx {\hat{M C R}}_{+} (ϵ_{out})$ , then the empirical estimator ${\hat{M C R}}_{+} (ϵ)$ is unlikely to substantially underestimate MCR₊(ϵ). By similar reasoning, we can conclude from Eq 4.3 that if ${\hat{M C R}}_{-} (ϵ) \approx {\hat{M C R}}_{-} (ϵ_{out})$ , then ${\hat{M C R}}_{+} (ϵ)$ is unlikely to substantially overestimate MCR₋(ϵ). By setting ϵ = 0, Theorem 4 can also be used to create a finite-sample bound for the reliance of the unique (unknown) best-in-class model on X₁ (see Corollary 22 in Appendix A.4), although describing individual models is not the main focus of this paper.

We provide a visual illustration of Theorem 4 in Figure 3. A brief sketch of the proof is as follows. First, we enlarge the empirical ϵ-Rashomon set by increasing ϵ to ϵ_out, such that, by Hoeffding’s inequality, $f_{+, ϵ} \in \hat{R} (ϵ_{out})$ with high probability. When $f_{+, ϵ} \in \hat{R} (ϵ_{out})$ , we know that ${\hat{M R}}_{_{(f_{+, ϵ})}} \leq {\hat{M C R}}_{+} (ϵ_{out})$ by the definition of ${\hat{M C R}}_{+} (ϵ_{out})$ . Next, the term $Q_{out}$ leverages finite-sample results for U-statistics to account for estimation error of MR(f_+,ϵ) = MCR₊(ϵ) when using the estimator ${\hat{M R}}_{_{(f_{+, ϵ})}}$ . Thus, we can relate ${\hat{M R}}_{_{(f_{+, ϵ})}}$ to both ${\hat{M C R}}_{+} (ϵ_{out})$ and MCR₊(ϵ) in order to obtain Eq 4.2. Similar steps can be applied to obtain Eq 4.3.

The bounds in Theorem 4 naturally account for potential overfitting without an explicit limit on model class complexity (such as a covering number, Rademacher complexity, or VC dimension). Instead, these bounds depend on being able to fully optimize MR across sets in the form of $\hat{R} (ϵ)$ . If we allow our model class $F$ to become more flexible, then the size of $\hat{R} (ϵ)$ will also increase. Because the bounds in Theorem 4 result from optimizing over $\hat{R} (ϵ)$ , increasing the size of $\hat{R} (ϵ)$ results in wider, more conservative bounds. In this way, Eqs 4.2 and 4.3 implicitly capture model class complexity.

So far, Theorem 4 lets us bound the range of MR values corresponding to models that predict well, but it does not tell us whether these bounds are actually attained. Similarly, we can conclude from Theorem 4 that [MCR₋(ϵ), MCR₊(ϵ)] is unlikely to exceed the estimated range [ ${\hat{M C R}}_{-} (ϵ)$ , ${\hat{M C R}}_{+} (ϵ)$ ] by a substantial margin, but we cannot determine whether this estimated range is unnecessarily wide. For example, consider the models that drive the ${\hat{M C R}}_{+} (ϵ)$ estimator: the models with strong in-sample accuracy, and high empirical reliance on X₁. These models’ in-sample performance could merely be the result of overfitting, in which case they do not tell us direct information about $R (ϵ)$ . Alternatively, even if all of these models truly do perform well on expectation (that is, even if they are contained in $R (ϵ)$ ), the model with the highest empirical reliance on X₁ may merely be the model for which our empirical MR estimate contains the most error. Either of these scenarios can cause ${\hat{M C R}}_{+} (ϵ)$ to be unnecessarily high, relative to MCR₊(ϵ).

Fortunately, both problematic scenarios are solved by requiring a limit on the complexity of $F$ . We propose a complexity measure in the form of a covering number, which allows us control a worst case scenario of either overfitting or MR estimation error. Specifically, we define the set of functions $G_{r}$ as an r-margin-expectation-cover if for any $f \in F$ and any distribution D, there exists $g \in G_{r}$ such that

E_{Z \sim D} | L (f, Z) - L (g, Z) | \leq r .

(4.4)

We define the covering number $N (F, r)$ to be the size of the smallest r-margin-expectation-cover for $F$ . In general, we use $P_{V \sim D}$ and $E_{V \sim D}$ to denote probabilities and expectations with respect to a random variable V following the distribution D. We abbreviate these quantities accordingly when V or D are clear from context, for example, as $P_{D}$ , $P_{V}$ , or simply $P$ . Unless otherwise stated, all expectations and probabilities are taken with respect to the (unknown) population distribution.

We first show that this complexity measure allows us to control the worst case MR estimation error, that is, the covering number $N (F, r)$ provides a uniform bound on the error of $\hat{M R} (f)$ for all $f \in F$ .

Theorem 5 (Uniform bound for $\hat{M R}$ ) Given r > 0, if Assumptions 1 and 3 hold for all $f \in F$ , then

P [\sup_{f \in F} | \hat{M R} (f) - M R (f) | > q (δ, r, n)] \leq δ,

where

q (δ, r, n) : = \frac{B_{s w i t c h}}{b_{o r i g}} - \frac{B_{s w i t c h} - {B_{i n d} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}}}{b_{o r i g} + {B_{i n d} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r}} .

(4.5)

Theorem 5 states that, with high probability, the largest possible estimation error for MR(f) across all models in $F$ is bounded by q(δ, r, n), which can be made arbitrarily small by increasing n and decreasing r. As we noted in Section 3.1, this means that it is possible to train a model and estimate its reliance on variables without using sample-splitting.

The covering number $N (F, r)$ can also be used to limit the extent of overfitting (see Appendix B.5.1). As a result, it is possible to set an in-sample performance threshold low enough so that it will only be met by models with strong expected performance (that is, by models truly within $R (ϵ)$ ). To implement this idea of a stricter performance threshold, we contract the empirical ϵ-Rashomon set by subtracting a buffer term from ϵ. This requires that we generalize the definition of an empirical ϵ-Rashomon set to $\hat{R} (ϵ, f_{ref}, F) : = {f_{ref}} ⋃ {f \in F : {\hat{e}}_{orig} (f) \leq {\hat{e}}_{orig} (f_{ref}) + ϵ}$ for $ϵ \in R$ , where the explicit inclusion of f_ref now ensures that $\hat{R} (ϵ, f_{ref}, F)$ is nonempty, even for ϵ < 0. As before, we typically omit the notation f_ref and $F$ , writing $\hat{R} (ϵ)$ instead.

We are now prepared to answer the questions of whether the bounds from Theorem 4 are actually attained, and of whether the estimated range $[{\hat{M C R}}_{-} (ϵ), {\hat{M C R}}_{+} (ϵ)]$ is unnecessarily wide. Our answer comes in the form of an upper bound on MCR₋(ϵ), and a lower bound on MCR₊(ϵ).

Theorem 6 (“Inner” MCR Bounds) Given constants ϵ ≥ 0 and r > 0, if Assumptions 1, 2 and 3 hold for all $f \in F$ , and then

\begin{matrix} P ({M C R}_{+} (ϵ) < {\hat{M C R}}_{+} (ϵ_{i n}) - Q_{i n}) \leq δ, & a n d \end{matrix}

(4.6)

P (M C R_{-} (ϵ) > {\hat{M C R}}_{-} (ϵ_{i n}) + Q_{i n}) \leq δ,

(4.7)

where $ϵ_{i n} : = ϵ - 2 B_{r e f} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} - 2 r, and Q_{i n} = q (\frac{δ}{2}, r, n)$ , as defined in Eq 4.5.

Theorem 6 can allow us to infer an “inner” bound that is contained within the interval [MCR₋(ϵ), MCR₊(ϵ)] with high probability. In Figure 3, we illustrate the result of Theorem 6, and give a sketch of the proof. This proof follows a similar structure to that of Theorem 4, but incorporates Theorem 5’s uniform bound on MR estimation error ( $Q_{in}$ term), as well as an additional uniform bound on the probability that any model has in-sample loss too far from its expected loss (ϵ_in term).

A practical implication of Theorem 6 is that, roughly speaking, if ${\hat{M C R}}_{+} (ϵ_{in}) \approx {\hat{M C R}}_{+} (ϵ)$ then it is unlikely for the empirical estimator ${\hat{M C R}}_{+ (ϵ)}$ to substantially underestimate MCR₊(ϵ). Taken together with Theorem 4, we can conclude that, if ${\hat{M C R}}_{+} (ϵ_{in}) \approx {\hat{M C R}}_{+} (ϵ_{out})$ , then the estimator ${\hat{M C R}}_{+} (ϵ)$ is unlikely either to overestimate or to underestimate MCR₊(ϵ) by very much. In large samples, it may be plausible to expect the condition ${\hat{M C R}}_{+} (ϵ_{in}) \approx {\hat{M C R}}_{+} (ϵ_{out})$ to hold, since ϵ_in and ϵ_out both approach ϵ as n increases. In the same way, if ${\hat{M C R}}_{-} (ϵ_{in}) \approx {\hat{M C R}}_{-} (ϵ_{out})$ , we can conclude from Eqs 4.3 & 4.7 that the empirical estimator ${\hat{M C R}}_{-} (ϵ)$ is unlikely to either overestimate or underestimate MCR₋(ϵ) by very much. For this reason, we argue that ${\hat{M C R}}_{-} (ϵ)$ and ${\hat{M C R}}_{+} (ϵ)$ form sensible estimates of population-level MCR – each is contained within a neighborhood of its respective estimand, with high probability. The secondary x-axis of Figure 3 gives an illustration of this argument.

5. Extensions of Rashomon Sets Beyond Variable Importance

In this section we generalize the Rashomon set approach beyond the study of MR. In Section 5.1, we create finite-sample CIs for other summary characterizations of near-optimal, or best-in-class models. The generalization also helps to illustrate a core aspect of the argument underlying Theorem 4: models with near-optimal performance in the population tend to have relatively good performance in random samples.

In Section 5.2, we review existing literature on near-optimal models.

5.1. Finite-sample Confidence Intervals from Rashomon Sets

Rather than describing how much a model relies on X₁, here we assume the analyst is interested in an arbitrary characteristic of a model. We denote this characteristic of interest as $ϕ : F \to R$ . For example, if f_β is the linear model $f_{β} (x) = x^{'} β,$ then ϕ may be defined as the norm of the associated coefficient vector (that is, $(ϕ (f_{β}) = {‖ β ‖}_{2}^{2})$ or the prediction f_β would assign given a specific covariate profile x_new (that is $ϕ (f_{β}) = f_{β} (x_{n e w})$ ).

Given a descriptor ϕ, we now show a general result that allows creation of finite-sample CIs for the best performing models $R (ϵ)$ . The resulting CIs are themselves based on empirical Rashomon sets.

Proposition 7 (Finite sample CIs from Rashomon sets) Let $ϵ^{'} : = ϵ + 2 B_{r e f} \sqrt{\frac{\log (2^{δ - 1})}{2 n}}$ , let ${\hat{ϕ}}_{-} (ϵ^{'}) : = \min_{f \in \hat{R} (ϵ^{'})} ϕ (f)$ , and let ${\hat{ϕ}}_{+} (ϵ^{'}) : = \max_{f \in \hat{R} (ϵ^{'})} ϕ (f)$ .

If Assumption 2 holds for all $f \in R (ϵ)$ , then

P [{ϕ (f) : f \in R (ϵ)} \subseteq [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]] \geq 1 - δ .

Proposition 7 generates a finite-sample CI for the range of values ϕ(f) corresponding to well-performing models, ${ϕ (f) : f \in R (ϵ)}$ . This CI, denoted by $[{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'}),]$ can itself be interpreted as the range of values ϕ(f) corresponding to models f with empirical loss not substantially above that of f_ref. Thus, the interval has both a rigorous coverage rate and a coherent in-sample interpretation. The proof of Proposition 7 uses Hoeffding’s inequality to show that models in $F$ are contained in $\hat{R} (ϵ^{'})$ with high probability, that is, that models with good expected performance tend to perform well in random samples.

An immediate corollary of Proposition 7 is that we can generate finite-sample CIs for all best-in-class models $f^{⋆} \in {arg min}_{f \in F} E L (f, Z)$ by setting ϵ = 0. This corollary can be further strengthened if a single model f^⋆ is assumed to uniquely minimize $E L (f, Z)$ over $f \in F$ (see Appendix B.6).

Note that Proposition 7 implicitly assumes that ϕ(f) can be determined exactly for any model $f \in F$ , in order for the interval $[{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]$ to be precisely determined. This assumption does not hold, for example, if ϕ(f) = MR(f), or if ϕ(f) = Var{f(X₁, X₂)}, as these quantities depend on both f and the (unknown) population distribution. In such cases, an additional correction factor must be incorporated to account for estimation error of ϕ(f), such as the $Q_{out}$ term in Theorem 4.

In concurrent work, Coker et al. (2018) show that profile likelihood intervals take the same form as the interval $[{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]$ in Proposition 7. This means that a profile likelihood interval can also be expressed by minimizing and maximizing over an empirical Rashomon set. More specifically, consider the case where the loss function L is the negative of the known log likelihood function, and where f_ref is the maximum likelihood estimate of the “true model,” which in this case is f^⋆. If additional minor assumptions are met (see Appendix A.6 for details), then the (1 − δ)-level profile likelihood interval for ϕ(f^⋆) is equal to $[{\hat{ϕ}}_{-} (\frac{X_{1}, 1 - δ}{2 n}), {\hat{ϕ}}_{+} (\frac{X_{1}, 1 - δ}{2 n})]$ , where ( ${\hat{ϕ}}_{-}$ and ${\hat{ϕ}}_{+}$ are defined as in Proposition 7, and $X_{1, 1 - δ}$ is the 1 − δ percentile of a chi-square distribution with 1 degree of freedom.

Relative to a profile likelihood approach, the advantage of Proposition 7 is that it does not require asymptotics, it does not require that the likelihood be known up to a parametric form, and it can be extended to study the set of near-optimal prediction models $R (ϵ)$ , rather than a single, potentially misspecified prediction model f^⋆. This is especially useful when different near-optimal models accurately describe different aspects of the underlying data generating process, but none capture it completely. The disadvantage of Proposition 7 is that the required performance threshold of $ϵ^{'} = ϵ + 2 B_{r e f} \sqrt{\frac{\log (2 δ - 1)}{2 n}}$ decreases more slowly than the performance threshold of $\frac{X_{1}, 1 - δ}{2 n}$ required in a profile likelihood interval. Because our results from Section 4.1 carry a similar disadvantage, we use these results primarily to motivate point estimates describing the Rashomon set $R (ϵ)$ .

Still, it is worth emphasizing the generality of Proposition 7. Through this result, Rashomon sets allow us to reframe a wide set of finite-sample inference problems as in-sample optimization problems. The implied CIs are not necessarily in closed form, but the approach still opens an exciting pathway for deriving non-asymptotic results. For example, they imply that existing methods for profile likelihood intervals might be able to be reapplied to achieve finite-sample results. For highly complex model classes where profile likelihoods are difficult to compute, such as neural networks or random forests, approximate inference is sometimes achieved via approximate optimization procedures (for example, Markov chain Monte Carlo for Bayesian additive regression trees, in Chipman et al., 2010). Proposition 7 shows that similar approximate optimization methods could be repurposed to establish approximate, finite-sample inferences for the same model classes.

5.2. Related Literature on the Rashomon Effect

Breiman et al. (2001) introduced the “Rashomon effect” of statistics as a problem of ambiguity: if many models fit the data well, it is unclear which model we should try to interpret. Breiman suggests that the ensembling many well-performing models together can resolve this ambiguity, as the new ensemble model may perform better than any of its individual members. However, this approach may only push the problem from the member level to the ensemble level, as there may also be many different ensemble models that fit the data well.

The Rashomon effect has also been considered in several subject areas outside of VI, including those in non-statistical academic disciplines (Heider, 1988; Roth and Mehta, 2002). Tulabandhula and Rudin (2014) optimize a decision rule to perform well under the predicted range of outcomes from any well-performing model. Statnikov et al. (2013) propose an algorithm to discover multiple Markov boundaries, that is, minimal sets of covariates such that conditioning on any one set induces independence between the outcome and the remaining covariates. Nevo and Ritov (2017) report interpretations corresponding to a set of well-fitting, sparse linear models. Meinshausen and Bühlmann (2010) estimate structural aspects of an underlying model (such as the variables included in that model) based on how stable those aspects are across a set of well-fitting models. This set of well-fitting models is identified by repeating an estimation procedure in a series of perturbed samples, using varying levels of regularization (see also Azen et al., 2001). Letham et al. (2016) search for a pair of well-fitting dynamical systems models that give maximally different predictions.

6. Calculating Empirical Estimates of Model Class Reliance

In this section, we propose a binary search procedure to bound the values of ${\hat{M C R}}_{-} (ϵ)$ and ${\hat{M C R}}_{+} (ϵ)$ (see Eq 2.4), which respectively serve as estimates of MCR₋(ϵ) and MCR₊(ϵ) (see Section 4.1). Each step of this search consists of minimizing a linear combination of ê_orig(f) and ê_switch(f) across $f \in F$ . Our approach is related to the fractional programming approach of Dinkelbach (1967), but accounts for the fact that the problem is constrained by the value of the denominator, ê_orig(f). We additionally show that, for many model classes, computing ${\hat{M C R}}_{-} (ϵ)$ only requires that we minimize convex combinations of ê_orig(f) and ê_switch(f), which is no more difficult than minimizing the average loss over an expanded and reweighted sample (See Eq 6.2 & Proposition 11).

Computing ${\hat{M C R}}_{+} (ϵ)$ however will require that we are able to minimize arbitrary linear combinations of ê_orig(f) and ê_switch(f). In Section 6.3, we outline how this can be done for convex model classes – classes for which the loss function is convex in the model parameter. Later, in Section 7, we give more specific computational procedures for when $F$ is the class of linear models, regularized linear models, or linear models in a reproducing kernel Hilbert space (RKHS). We summarize the tractability of computing empirical MCR for different model classes in Table 1.

Table 1:

Tractability of empirical MCR computation for different model classes - For each case, we describe the tractability of computing ${\hat{M C R}}_{-}$ and ${\hat{M C R}}_{+}$ using our proposed approaches. Computing empirical MCR can be reduced to a sequence of optimization problems, the form of which are noted in parentheses within the above table.

Model class and loss function ( $F & L$ )	Computing ${\hat{M C R}}_{-}$	Computing ${\hat{M C R}}_{+}$
(L2 Regularized) Linear models, with the squared error loss	Highly tractable (QP1QC, see Sections 7.2 & 7.3)	Highly tractable (QP1QC, see Sections 7.2 & 7.3)
Linear models in a reproducing kernel Hilbert space, with the squared error loss	Moderately tractable (QP1QC, see Section 7.4.1)	Moderately tractable (QP1QC, see Section 7.4.1)
Cases where irrelevant covariates do not improve predictions	Moderately tractable (Convex optimization problems, see Proposition 11)	Potentially intractable
Cases where minimizing the empirical loss is a convex optimization problem	Potentially intractable (DC programs, see Section 6.3)	Potentially intractable (DC programs, see Section 6.3)

Open in a new tab

To simplify notation associated with the reference model f_ref, we present our computational results in terms of bounds on empirical MR subject to performance thresholds on the absolute scale. More specifically, we present bound functions b₋ and b₊ satisfying $b_{-} (ϵ_{abs}) \leq \hat{M R} (f) \leq b_{+} (ϵ_{abs})$ simultaneously for all ${f, ϵ_{abs} : {\hat{e}}_{orig} (f) \leq ϵ_{abs}, f \in F, ϵ_{abs} > 0}$ (Figures 2 & 8 show examples of these bounds). The binary search procedures we propose can be used to tighten these boundaries at a particular value ϵ_abs of interest.

Figure 8: — Empirical MR and MCR for Broward County criminal records data set - For any prediction model f, the y-axis shows empirical loss $({\hat{e}}_{stnd} (f))$ and the x-axis shows empirical reliance $(\hat{M R} (f))$ on each covariate subset. Null reliance (MR equal to 1.0) is marked by the vertical dotted line. Reliances on different covariate subsets are marked by color (“admissible” = blue; “inadmissible” = gray). For example, model reliance values for f_ref are shown by the two circular points, one for “admissible” variables and one for “inadmissible” variables. MCR for different values of ϵ can be represented as boundaries on this coordinate space. To this end, for each covariate subset, we compute conservative boundary functions (shown as solid lines, or “bowls”) guaranteed to contain *all models in the class* (see Section 6). Specifically, all models in $f \in F_{D, r_{k}}$ are guaranteed to have an empirical loss $({\hat{e}}_{stnd} (f))$ and empirical MR value $(\hat{M R} (f))$ for “inadmissible variables” corresponding to a point within the gray bowl. Likewise, all models in $F_{D, r_{k}}$ are guaranteed to have an empirical loss and empirical MR value for “admissible variables” corresponding to a point within the blue bowl. Points shown as “×” represent additional models in $F_{D, r_{k}}$ discovered during our computational procedure, and thus show where the “bowl” boundary is tight. The goal of our computation procedure (see Section 6) is to tighten the boundary as much as possible near the ϵ value of interest, shown by the dashed horizontal line above. This dashed line has a y-intercept equal to the loss of the reference model plus the ϵ value of interest. Bootstrap CIs for *MCR*₋(ϵ) and *MCR*₊(ϵ) are marked by brackets.

We briefly note that as an alternative to the global optimization procedures we discuss below, heuristic optimization procedures such as simulated annealing can also prove useful in bounding empirical MCR. By definition, the empirical MR for any model in $\hat{R} (ϵ)$ forms a lower bound for ${\hat{M C R}}_{+} (ϵ)$ , and an upper bound for ${\hat{M C R}}_{-} (ϵ)$ . Heuristic maximization and minimization of empirical MR can be used to tighten these boundaries.

Throughout this section, we assume that $0 < \min_{f \in F} ê_{orig} (f)$ , to ensure that MR is finite.

6.1. Binary Search for Empirical MR Lower Bound

Before describing our binary search procedure, we introduce additional notation used in this section. Given a constant $γ \in R$ and prediction model $f ϵ F$ , we define the linear combination ${\hat{h}}_{-, γ}$ , and its minimizers (for example, ${\hat{g}}_{-, γ}, F$ ), as

{\hat{h}}_{-, γ} (f) : = γ {\hat{e}}_{orig} (f) + {\hat{e}}_{switch} (f), {and \hat{g}}_{-, γ, F} \in \underset{f \in F}{arg min} {\hat{h}}_{-, γ} (f) .

We do not require that ${\hat{h}}_{-, γ}$ is uniquely minimized, and we frequently use the abbreviated notation ${\hat{g}}_{-, γ}$ when $F$ is clear from context.

Our goal in this section is to derive a lower bound on $\hat{M R}$ for subsets of $F$ in the form of ${f \in F : {\hat{e}}_{orig} (f) \leq ϵ_{a b s}}$ . We achieve this by minimizing a series of linear objective functions in the form of ${\hat{h}}_{-, γ}$ , using a similar method to that of Dinkelbach (1967). Often, minimizing the linear combination ${\hat{h}}_{-, γ} (f)$ is more tractable than minimizing the MR ratio directly.

Almost all of the results shown in this section, and those in Section 6.2, also hold if we replace ${\hat{e}}_{switch}$ with ${\hat{e}}_{divide}$ throughout (see Eq 3.5), including in the definition of $\hat{M R}$ and ${\hat{h}}_{-, γ} (f)$ . The exception is Proposition 11, below, which we may still expect to approximately hold if we replace ${\hat{e}}_{switch}$ with ${\hat{e}}_{divide}$ .

Given an observed sample, we define the following condition for a pair of values ${γ, ϵ_{abs}} \in R \times R_{> 0}$ , and argmin function ${\hat{g}}_{-, γ}$ :

Condition 8 (Criteria to continue search for $\hat{M R}$ lower bound) ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ}) \geq 0$ and ${\hat{e}}_{o r i g} (\hat{g} -, γ) \leq ϵ_{a b s}$ .

We are now equipped to determine conditions under which we can tractably create a lower bound for empirical MR.

Lemma 9 (Lower bound for $\hat{M R}$ ) If $γ \in R$ satisfies ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ}) \geq 0$ , then

\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} - γ \leq \hat{M R} (f)

(6.1)

for all $f \in F$ satisfying ${\hat{e}}_{o r i g} (f) \leq ϵ_{a b s}$ . It also follows that

- γ \leq \hat{M R} (f) for all f \in F .

Additionally, if $f = {\hat{g}}_{-, γ}$ and at least one of the inequalities in Condition 8 holds with equality, then Eq 6.1 holds with equality.

Lemma 9 reduces the challenge of lower-bounding $\hat{M R} (f)$ to the task of minimizing the linear combination ${\hat{h}}_{-, γ} (f)$ . The result of Lemma 9 is not only a single boundary for a particular value of ϵ_abs, but a boundary function that holds all values of ϵ_abs > 0, with lower values of ϵ_abs leading to more restrictive lower bounds on $\hat{M R} (f)$ .

In addition to the formal proof for Lemma 9, we provide a heuristic illustration of the result in Figure 4, to aid intuition.

It remains to determine which value of γ should be used in Eq 6.1. The following lemma implies that this value can be determined by a binary search, given a particular value of interest for ϵ_abs.

Lemma 10 (Monotonicity for $\hat{M R}$ lower bound binary search) The following monotonicity results hold:

${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})$ is monotonically increasing in γ.
${\hat{e}}_{orig} ({\hat{g}}_{-, γ})$ is monotonically decreasing in γ.
Given ϵ_abs, the lower bound from Eq 6.1, ${\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{a b s}} - γ}$ , is monotonically decreasing in γ in the range where ${\hat{e}}_{orig} ({\hat{g}}_{-, γ}) \leq ϵ_{abs}$ , and increasing otherwise.

Given a particular performance level of interest, ϵ_abs, Point 3 of Lemma 10 tells us that the value of γ resulting in the tightest lower bound from Eq 6.1 occurs when γ is as low as possible while still satisfying Condition 8. Points 1 and 2 show that if γ₀ satisfies Condition 8, and one of the equalities in Condition 8 holds with equality, then Condition 8 holds for all γ ≥ γ₀. Together, these results imply that we can use a binary search to determine the value of γ to be used in Lemma 9, reducing this value until Condition 8 is no longer met. In addition to the formal proof for Lemma 10, we provide an illustration of the result in Figure 5 to aid intuition.

Figure 5: — Monotonicity for binary search. Above we show a version of Figure 4–C for two alternative values of γ. This figure is meant, to add intuition for the monotonicity results in Lemma 10, in addition to the formal proof. Increasing γ is equivalent to *decreasing* the slope of the red line in Figure 4–C. We define two values γ₁ < γ₂, where γ₁ corresponds to the solid red line, above, and γ₂ corresponds to the semi-transparent, red line. The y-intercept values of these lines are equal to ${\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}})$ and ${\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}})$ respectively (see Figure 4–C caption). The solid and semi-transparent, black dots mark $({\hat{g}}_{-, γ_{1}})$ and $({\hat{g}}_{-, γ_{2}})$ respectively. Plugging γ₁ and γ₂ into Eq 6.1 yields two lower bounds for $\hat{M R}$ , marked by the slopes of the solid and semi-transparent, black lines respectively (see Figure 4–C caption). We see that (1) ${\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}})$ , that (2) ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \geq {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) \geq$ , and that (3) the left-hand side of Eq 6.1 is decreasing in γ when ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \leq ϵ_{abs}$ . These three conclusions are marked by arrows in the above figure, with numbering matching the enumerated list in Lemma 10.

Next we present simple conditions under which the binary search for values of γ can be restricted to the nonnegative real line. This result substantially extends the computational tractability of our approach, as minimizing ${\hat{h}}_{-, γ}$ for γ ≥ 0 is equivalent to minimizing a reweighted empirical loss over an expanded sample of size n²:

{\hat{h}}_{-, γ} (f) = γ {\hat{e}}_{orig} (f) + {\hat{e}}_{switch} (f) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} ω_{γ} (i, j) L {f, (y_{[i]}, X_{1 [j, \cdot]}, X_{2 [i, \cdot]})},

(6.2)

where $ω_{γ} (i, j) = \frac{γ 1 (i = j)}{n} + \frac{1 (i \neq j)}{n (n - 1)} \geq 0$ .

Proposition 11 (Nonnegative weights for $\hat{M R}$ lower bound binary search) Assume that L and $F$ satisfy the following conditions.

(Predictions are sufficient for computing the loss) The loss L{f, (Y, X₁, X₂)} depends on the covariates (X₁, X₂) only via the prediction function f, that is, $L {f, (y, x_{1}^{(a)}, x_{2}^{(a)})} = L {f, (y, x_{1}^{(b)}, x_{2}^{(b)})}$ whenever $f (x_{1}^{(a)}, x_{2}^{(a)}) = f (x_{1}^{(b)}, x_{2}^{(b)})$ .
(Irrelevant information does not improve predictions) For any distribution D satisfying X₁ ⊥_D (X₂, Y), there exists a function f_D satisfying
$E_{D} L {f_{D}, (Y, X_{1}, X_{2})} = \min_{f \in F} E_{D} L {f, (Y, X_{1}, X_{2})},$
and
$f_{D} (x_{1}^{(a)}, x_{2}) = f_{D} (x_{1}^{(b)}, x_{2}) f o r a n y x_{1}^{(a)}, x_{1}^{(b)} \in X_{1} a n d x_{2} \in X_{2} .$ (6.3)

Let γ = 0. Under the above assumptions, it follows that either (i) there exists a function ${\hat{g}}_{-, 0}$ minimizing ${\hat{h}}_{-, 0}$ that does not. satisfy Condition 8, or (ii) ${\hat{e}}_{o r i g} (g -, 0) \leq ϵ_{abs} and {\hat{M R}}_{(g -, 0)} \leq 1$ for any function ${\hat{g}}_{-, 0}$ minimizing ${\hat{h}}_{-, 0}$ .

The implication of Proposition 11 is that, when the conditions of Proposition 11 are met, the search region for γ can be limited to the nonnegative real line, and minimizing ${\hat{h}}_{-, γ}$ will be no harder than minimizing a reweighted empirical loss over an expanded sample (Eq 6.2). To see this, recall that for a fixed value of ϵ_abs we can tighten the boundary in Lemma 9 by conducting a binary search for the smallest, value of γ that satisfies Condition 8. If setting γ equal to 0 does not satisfy Condition 8, and the search for γ can be restricted to the nonnegative real line, where minimizing ${\hat{h}}_{-, 0}$ is more tractable (see Eq 6.2). Alternatively, if ${\hat{e}}_{orig} (g -, 0) \leq ϵ_{abs} and {\hat{M R}}_{(g -, 0)} \leq 1$ , then we have identified a well-performing model g_−,0 with empirical MR no greater than 1. For ϵ_abs = ê_orig(f_ref) + ϵ, this implies that ${\hat{M C R}}_{-} (ϵ) \leq 1$ , which is a sufficiently precise conclusion for most, interpretational purposes (see Appendix A.2).

Because of the fixed pairing structure used in ê_divide, Proposition 11 will not necessarily hold if we replace ê_switch with ê_divide throughout (see Appendix C.3). However, since ê_divide approximates ê_switch, we can expect Proposition 11 to hold approximately. The bound from Eq 6.1 still remains valid if we replace ê_switch with ê_divide and limit γ to the nonnegative reals, although in some cases it may not be as tight.

6.2. Binary Search for Empirical MR Upper Bound

We now briefly present a binary search procedure to upper bound $\hat{M R}$ , which mirrors the procedure from Section 6.1. Given a constant $γ \in R$ and prediction model $f \in F$ , we define the linear combination ${\hat{h}}_{+, γ}$ , and its minimizers (for example, ${\hat{g}}_{+, γ, F}$ ), as

\hat{h} {}_{+, γ}{(f)} : = {\hat{e}}_{orig} (f) + γ {\hat{e}}_{switch} (f), and {\hat{g}}_{+, γ, F} \in \underset{f \in F}{arg min} {\hat{h}}_{+, γ} (f) .

As in Section 6.1, ${\hat{h}}_{+, γ}$ need not be uniquely minimized, and we generally abbreviate ${\hat{g}}_{+, γ, F}$ as ${\hat{g}}_{+, γ}$ when $F$ is clear from context.

Given an observed sample, we define the following condition for a pair of values ${γ, ϵ_{abs}} \in R_{\leq 0} \times R_{> 0}$ , and argmin function ${\hat{g}}_{+, γ}$ :

Condition 12 (Criteria to continue search for $\hat{M R}$ upper bound) ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) \geq 0$ and ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \leq ϵ_{abs}$ .

We can now develop a procedure to upper bound $\hat{M R}$ , as shown in the next lemma.

Lemma 13 (Upper bound for $\hat{M R}$ ) If $γ \in R$ satisfies γ ≤ 0 and ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) \geq 0$ , then

\hat{M R} (f) \leq {\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{a b s}} - 1} γ^{- 1}

(6.4)

for all $f \in F$ satisfying ê_orig(f) ≤ ϵ_abs. It also follows that

\hat{M R} (f) \leq | γ^{- 1} | for all f \in F .

(6.5)

Additionally, if $f = {\hat{g}}_{+, γ}$ and at least one of the inequalities in Condition 12 holds with equality, then Eq 6.4 holds with equality.

As in Section 6.1, it remains to determine the value of γ to use in Lemma 13, given a value of interest for $ϵ_{abs} \geq \min_{f \in F} {\hat{e}}_{orig} (f)$ . The next lemma tells us that the boundary from Lemma 13 is tightest when γ is as low as possible while still satisfying Condition 12.

Lemma 14 (Monotonicity for $\hat{M R}$ upper bound binary search) The following monotonicity results hold:

${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})$ is monotonically increasing in γ.
${\hat{e}}_{orig} ({\hat{g}}_{+, γ})$ is monotonically decreasing in γ for γ ≤ 0, and Condition 12 holds for γ = 0 and $ϵ_{abs} \geq \min_{f \in F} {\hat{e}}_{orig} (f)$ .
Given ϵ_abs, the upper boundary ${\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{a b s}} - 1} γ^{- 1}$ is monotonically increasing in γ in the range where ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \leq ϵ_{abs}$ and γ < 0, and decreasing in the range where ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) > ϵ_{abs}$ and γ < 0.

Together, the results from Lemma 14 imply that we can use a binary search across $γ \in R$ to tighten the boundary on $\hat{M R}$ from Lemma 13.

6.3. Convex Models

In this section we show that empirical MCR can be conservatively computed when the loss function is convex in the model parameters – that is, when the models $f_{θ} \in F$ are indexed by a d-dimensional parameter $θ \in Θ \subseteq R^{d}$ , and when the loss function L(f_θ, (y, x₁, x₂)) is convex in θ for all $(x_{1}, x_{2}, y) \in X_{1} \times (X_{2}, Y)$ .

Fortunately, neither Lemma 9 nor Lemma 13 require an exact minimum for ${\hat{h}}_{-, γ}$ or ${\hat{h}}_{+, γ}$ . For Lemma 9, any lower bound on ${\hat{h}}_{-, γ}$ is sufficient to determine a lower bound on MR(f). Likewise, for Lemma 13, any lower bound on ${\hat{h}}_{+, γ}$ is sufficient to determine an upper bound on MR(f).

To find these lower bounds, we note that for “convex” model classes (defined above) the optimization problems in Sections 6.1 & 6.2 can be written either as convex optimization problems, or as difference convex function (DC) programs. A DC program is one that can be written as

\min_{{θ : c_{DC} (θ) \leq k, θ \in Θ}} g_{DC} (θ) - h_{DC} (θ),

where c_DC is a constraint function, $k \in R^{1}$ , and g_DC, h_DC, and c_DC are convex. Although precise solutions to DC problems are not always tractable, lower bounds can be attained by branch-and-bound (B&B) methods (Horst and Thoai, 1999). A simple B&B approach is to partition Θ into a set of simplexes. Within the j^th simplex, a lower bound on g_DC(θ)−h_DC(θ) can be determined by replacing h_DC with the hyperplane function h_j satisfying h_j (υ) = h_DC(υ) at each vertex υ of the j^th simplex. Within this partition, g_DC(θ) − h_DC(θ) is lower bounded by l_j := min_θ g_DC(θ) − h_j(θ), which can be computed as the solution to a convex optimization problem. Any partition for which l_j is found to be too high is disregarded. Once a bound l_j is computed for each partition, the partition with the lowest value l_j is selected to be subdivided further, and additional lower bounds are recomputed for each new, resulting partition. This procedure continues until a sufficiently tight lower bound is attained (for more detailed procedures, see Horst and Thoai, 1999).

This approach allows us to conservatively approximate bounds on $\hat{M R} (f)$ in the form of Eq 6.1 & 6.4 by replacing ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})$ and ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})$ with lower bounds from the B&B procedure. Although it will always yield valid bounds, the procedure may converge slowly when the dimension of Θ is large, giving highly conservative results. For some special cases of model classes however, even high dimensional DC problems simplify greatly. We discuss these cases in the next section.

7. MR & MCR for Linear Models, Additive Models, and Regression Models in a Reproducing Kernel Hilbert Space

For linear or additive models, many simplifications can be made to our approaches for MR and MCR. To simplify the interpretation of MR, we show below that population-level MR for a linear model can be expressed in terms of the model’s coefficients (Section 7.1). To simplify computation, we show that the cost of computing empirical MR for a linear model grows only linearly in n (Section 7.1), even though the number of terms in the definition of empirical MR grows quadratically (see Eqs 3.3 & 3.6).

Moving on from MR, we show how empirical MCR can be computed for the class of linear models (Section 7.2), for regularized linear models (Section 7.3), and for regression models in a reproducing kernel Hilbert space (RKHS, Section 7.4). To do this, we build on the approach in Section 6 by giving approaches for minimizing arbitrary combinations of ê_switch(f) and ê_orig(f) across $f \in F$ . Even when the associated objective functions are non-convex, we can tractably obtain global minima for these model classes. We also discuss procedures to determine an upper bound B_ind on the loss for any observation when using these model classes (see Assumption 1).

Throughout this section, we assume that $X \in R^{p}$ for $p \in Z^{+}$ , that $Y \subset R^{1}$ , and that L is the squared error loss function L(f, (y, x₁, x₂) = (y − f(x₁, x₂))². As in Section 6, we also assume that $0 < \min_{f \in F} {\hat{e}}_{orig} (f)$ , to ensure that empirical MR is finite.

7.1. Interpreting and Computing MR for Linear or Additive Models

We begin by considering MR for linear models evaluated with the squared error loss. For this setting, we can show both an interpretable definition of MR, as well as a computationally efficient formula for ê_switch(f).

Proposition 15 (Interpreting MR, and computing empirical MR for linear models) For any prediction model f, let e_orig(f), e_switch(f), ê_orig(f), and ê_swith(f) be defined based on the squared error loss L(f, (y, x₁, x₂)) := (y − f (x₁, x₂))² for $y \in R$ , $x_{1} \in R^{p_{1}}$ , and $x_{2} \in R^{p_{2}}$ , where p₁ and p₂ are positive integers. Let β = (β₁, β₂) and f_β satisfy $β_{1} \in R^{p_{1}}$ , $β_{2} \in R^{p_{2}}$ , and $f_{β} (x) = x^{'} β = x_{1}^{'} β_{1} + x_{2}^{'} β_{2}$ . Then

M R (f_{β}) = 1 + \frac{2}{e_{o r i g} (f_{β})} {C o υ (Y, X_{1}) β_{1} - β_{2}^{'} C o υ (X_{2}, X_{1}) β_{1}},

(7.1)

and, for finite samples,

{\hat{e}}_{s w i t c h} (f_{β}) = \frac{1}{n} {y^{'} y - 2 {[\begin{matrix} {X^{'}}_{1} W_{y} \\ {X^{'}}_{2} y \end{matrix}]}^{'} β + β^{'} [\begin{matrix} {X^{'}}_{1} X_{1} & {X^{'}}_{1} W X_{2} \\ {X^{'}}_{2} W X_{1} & {X^{'}}_{2} X_{2} \end{matrix}] β},

(7.2)

where $W : = \frac{1}{n - 1} (1_{n} 1_{n}^{'} - I_{n})$ , 1_n is the n-length vector of ones, and I_n is the n × n identity matrix.

Eq 7.1 shows that model reliance for linear models can be interpreted in terms of the population covariances, the model coefficients, and the model’s accuracy. Gregorutti et al. (2017) show an equivalent formulation of Eq 7.1 under the stronger assumptions that f_β is equal to the conditional expectation function of Y (that is, $f_{β} = E (Y | X = x)$ ), and the covariates X₁ and X₂ are centered.

Eq 7.2 shows that, although the number of terms in the definition of ê_switch grows quadratically in n (see Eq 3.3), the computational complexity of ê_switch(f_β) for a linear model f_β grows only linearly in n. Specifically, the terms $X_{1}^{'} W y$ and $X_{1}^{'} W X_{2}$ in Eq 7.2 can be computed as $\frac{1}{n - 1} {(X_{1}^{'} 1_{n}) (1_{n}^{'} y) - (X_{1}^{'} y)}$ and $\frac{1}{n - 1} {(X_{1}^{'} 1_{n}) (1_{n}^{'} X_{2}) - (X_{1}^{'} X_{2})}$ respectively, where the computational complexity of each term in parentheses grows linearly in n.

As in Gregorutti et al. (2017), both results in Proposition 15 readily generalize to additive models of the form f_g₁,g₂ (X₁,X₂) := g₁ (X₁) + g₂(X₂), since permuting X₁ is equivalent to permuting g₁(X₁).

7.2. Computing Empirical MCR for Linear Models

Building on the computational result from the previous section, we now consider empirical MCR computation for linear model classes of the form

F_{lm} : = {f_{β} : f_{β} (x) = x^{'} β, β \in R^{p}} .

In order to implement the computational procedure from Sections 6.1 and 6.2, we must be able to minimize arbitrary linear combinations of ê_orig(f_β) and ê_switch(f_β). Fortunately, for linear models, this minimization reduces to a quadratic program, as we show in the next remark.

Remark 16 (Tractability of empirical MCR for linear model classes) For any $f_{β} \in F_{lm}$ and any fixed coefficients $ξ_{orig}, ξ_{switch} \in R$ , the linear combination

ξ_{o r i g} {\hat{e}}_{o r i g} (f_{β}) + ξ_{s w i t c h} {\hat{e}}_{s w i t c h} (f_{β})

(7.3)

is proportional in β to the quadratic function − 2q′β + β′Qβ, where

Q : = ξ_{o r i g} X^{'} X + ξ_{s w i t c h} [\begin{matrix} X_{1}^{'} X_{1} & X_{1}^{'} W X_{2} \\ X_{2}^{'} W X_{1} & X_{2}^{'} X_{2} \end{matrix}], q : = {(ξ_{o r i g} y^{'} X + ξ_{s w i t c h} {[\begin{matrix} X_{1}^{'} W y \\ X_{2}^{'} y \end{matrix}]}^{'})}^{'},

and $W : = \frac{1}{n - 1} (1_{n} 1_{n}^{'} - I_{n})$ . Thus, minimizing ξ_origê_orig(f_β) + ξ_switchê_switch(f_β) is equivalent to an unconstrained (possibly non-convex) quadratic program.

Because our empirical MCR computation procedure from Sections 6.1 and 6.2 consists of minimizing a sequence of objective functions in the form of Eq 7.3, Remark 16 shows us that this procedure is tractable for the class of unconstrained linear models.

7.3. Regularized Linear Models

Next, we continue to build on the results from Section 7.2 to calculate boundaries on $\hat{M R}$ for regularized linear models. We consider model classes formed by quadratically constrained subsets of $F_{lm}$ , defined as

F_{lm, r_{lm}} : = {f_{β} : f_{β} (x) = x^{'} β, β ϵ R^{p}, β^{'} M_{l m} β \leq r_{lm}},

(7.4)

where M_lm and r_lm are pre-specified. Again, this class describes linear models with a quadratic constraint on the coefficient vector.

7.3.1. Calculating MCR

As in Section 7.2, calculating bounds on $\hat{M R}$ via Lemmas 9 & 13 requires that are able to minimizing linear combinations ξ_origê_orig(f_β) + ξ_switchê_switch (f_β) across $f_{β} \in F_{lm, r_{lm}}$ for arbitrary $ξ_{orig}, ξ_{switch} \in R$ . Applying Remark 16, we can again equivalently minimize −2q′β + β′Qβ subject to the constraint in Eq 7.4:

\begin{array}{l} minimize - 2 q^{'} β + β^{'} Q β \\ subject to β^{'} M_{lm} β \leq r_{lm} . \end{array}

(7.5)

The resulting optimization problem is a (possibly non-convex) quadratic program with one quadratic constraint (QP1QC). This problem is well-studied, and is related to the trust region problem (Boyd and Vandenberghe, 2004; Pólik and Terlaky, 2007; Park and Boyd, 2017). Thus, the bounds on MCR presented in Sections 6.1 and 6.2 again become computationally tractable for the class of quadratically constrained linear models.

7.3.2. Upper Bounding the Loss

One benefit of constraining the coefficient vector (β′M_lmβ ≤ r_lm) is that it facilitates determining an upper bound B_ind on the loss function L(f_β, (y, x)) = (y − x′β)², which automatically satisfies Assumption 1 for all $f \in F_{lm, r_{lm}}$ . The following lemma gives sufficient conditions to determine B_ind.

Lemma 17 (Loss upper bound for linear models) If M_lm is positive definite, Y is bounded within a known range, and there exists a known constant $r_{X}$ such that $x^{'} M_{l m}^{- 1} x \leq r_{X}$ for all $x \in (X_{1} \times X_{2})$ , then Assumption 1 holds for the model class $F_{l m, r_{l m}}$ , the squared error loss function, and the constant

B_{i n d} = \max [{\min_{y \in Y} (y) - \sqrt{r X r_{l m}}}^{2}, {\max_{y \in Y} (y) - \sqrt{r X r_{l m}}}^{2}] .

In practice, the constant $r_{X}$ can be approximated by the empirical distribution of X and Y. The motivation behind the restriction $x^{'} M_{lm}^{- 1} x \leq r_{X}$ in Lemma 17 is to create complementary constraints on X and β. For example, if M_lm is diagonal, then the smallest elements of M_lm correspond to directions along which β is least restricted by β′M_lmβ ≤ r_lm (Eq 7.5), as well as the directions along which x is most restricted by $x^{'} M_{lm}^{- 1} x \leq r_{X}$ (Lemma 17).

7.4. Regression Models in a Reproducing Kernel Hilbert Space (RKHS)

We now expand our scope of model classes by considering regression models in a reproducing kernel Hilbert space (RKHS), which allow for nonlinear and nonadditive features of the covariates. We show that, as in Section 7.3, minimizing a linear combination of ê_orig(f) and ê_switch(f) across models f in this class can be expressed as a QP1QC, which allows us to implement the binary search procedure of Sections 6.1 & 6.2.

First we introduce notation required to describe regression in a RKHS. Let D be a (R×p) matrix representing a pre-specified dictionary of R reference points, such that each row of D is contained in $X = R^{p}$ . Let k be a pre-specified positive definite kernel function, and let μ be a prespecified estimate of $E Y$ . Let K_D be the R × R matrix with K_D[i,j] = k(D_[i,·], D_[j,·]). We consider prediction models of the following form, where the distance to each reference point is used as a regression feature:

F_{D, r_{k}} = {f_{α} : f_{α} (x) = μ + \sum_{i = 1}^{R} k (x, D_{[i, \cdot]}) α_{[i]}, {‖ f_{α} ‖}_{k} \leq r_{k}, α ϵ R^{R}} .

(7.6)

Above, the norm ‖f_α‖_k is defined as

{‖ f_{α} ‖}_{k} : = \sum_{i = 1}^{R} \sum_{j = 1}^{R} α_{[i]} α_{[j]} k (D_{[i, \cdot]}, D_{[j, \cdot]}) = α^{'} K_{D α} .

(7.7)

In the next two sections, we show that bounds on empirical MCR can again be tractably computed for this class, and that the loss for models in this class can be feasibly upper bounded.

7.4.1. Calculating MCR

Again, calculating bounds on $\hat{M R}$ from Lemmas 9 & 13 requires us to be able to minimize arbitrary linear combinations of ê_orig(f_α) and ê_switch(f_α).

Given a size-n sample of test observations Z = [ y X ], let K_orig be the n × R matrix with elements K_orig[i,j] = k (X_[i,·], D_[j,·]). Let Z_switch = [ y_switch X_switch ] be the (n(n − 1)) × (1 + p) matrix with rows that contain the set {(y_[i], X_1[j,·], X_2[i,·]) : i, j ∈ {1, … , n} and i ≠ j}. Finally, let K_switch be the n(n − 1) × R matrix with K_{switch_[i,j]} = k (X_switch[i,·], D_[j,·]).

For any two constants $ξ_{orig}, ξ_{switch} \in R$ , we can show that minimizing the linear combination $ξ_{orig} {\hat{e}}_{orig} (f_{α}) + ξ_{switch} {\hat{e}}_{switch} (f_{α})$ over $F_{D, r_{k}}$ is equivalent to the minimization problem

minimize \frac{ξ_{orig}}{n} {‖ y - μ - K_{orig} α ‖}_{2}^{2} + \frac{ξ_{switch}}{n (n - 1)} {‖ y_{switch} - μ - K_{switch} α ‖}_{2}^{2}

(7.8)

subject to α^{'} K_{D} α < r_{k} .

(7.9)

Like Problem 7.5, Problem 7.8-7.9 is a QP1QC. To show Eqs 7.8–7.9, we first write ê_orig(f_α) as

{\hat{e}}_{orig} (f_{α}) = \frac{1}{n} {\sum_{i = 1}^{n} {y_{[i]} - f_{α} (X_{[i, \cdot]})}}^{2} = \frac{1}{n} {\sum_{i = 1}^{n} {y_{[i]} - μ - \sum_{j = 1}^{R} k (X_{[i, \cdot]}, D_{[j, \cdot]}) α_{[j]}}}^{2} = \frac{1}{n} {\sum_{i = 1}^{n} {y_{[i]} - μ - {K^{'}}_{orig [i, \cdot]} α}}^{2}

(7.10)

= \frac{1}{n} {‖ y - μ - K_{orig} α ‖}_{2}^{2} .

(7.11)

Following similar steps, we can obtain

{\hat{e}}_{switch} (f_{α}) = \frac{1}{n (n - 1)} {‖ y_{switch} - μ - K_{switch} α ‖}_{2}^{2} .

Thus, for any two constants $ξ_{orig}, ξ_{switch} \in R$ , we can see that ξ_origê_orig(f_α)+ξ_switchê_switch(f_α) is quadratic in α. This means that we can tractably compute bounds on empirical MCR for this class as well.

7.4.2. Upper Bounding the Loss

Using similar steps as in Section 7.3.2, the following lemma gives sufficient conditions to determine B_ind for the case of regression in a RKHS.

Lemma 18 (Loss upper bound for regression in a RKHS) Assume that Y is bounded within a known range, and there exists a known constant r_D such that $υ (x)^{'} K_{D}^{- 1} υ (x) \leq r D$ for all $x \in (X_{1} \times X_{2})$ , where $υ : R^{p} \to R^{R})$ is the function satisfying υ(x)_[i] = k(x, D_[i,·]). Under these conditions, Assumption 1 holds for the model class $F_{D, r_{k}}$ , the squared error loss function, and the constant

B_{i n d} = \max [{\min_{y \in Y} (y) - (μ + \sqrt{r_{D} r_{k}})}^{2}, {\max_{y \in Y} (y) + (μ + \sqrt{r_{D} r_{k}})}^{2}] .

Thus, for regression models in a RKHS, we can satisfy Assumption 1 for all models in the class.

8. Connections Between MR and Causality

Our MR approach can be fundamentally described as studying how a model’s behavior changes under an intervention on the underlying data. We aim to study the causal effect of this intervention on the model’s performance. This goal mirror’s the conventional causal inference goal of studying how an intervention on variables will change outcomes generated by a process in nature.

This section explores this connection to causal inference further. Section 8.1 shows that when the prediction model in question is the conditional expectation function from nature itself, MR reduces to commonly studied quantities in the causal literature. Section 8.2 proposes an alternative to MR that focuses on interventions, or data perturbations, that are likely to occur in the underlying data generating process.

8.1. Model Reliance and Causal Effects

In this section, we show a connection between population-level model reliance and the conditional average causal effect. For consistency with the causal inference literature, we temporarily rename the random variables (Y, X₁, X₂) as (Y, T, C), with realizations (y, t, c). Here, T := X₁ represents a binary treatment indicator, C := X₂ represents a set of baseline covariates (“C” is for “covariates”), and Y represents an outcome of interest. Under this notation, $e_{orig} (f)$ represents the expected loss of a prediction function f, and $e_{switch} (f)$ denotes the expected loss in a pair of observations in which the treatment has been switched. Let $f_{0} (t, c) : = E (Y | C = c, T = t)$ be the (unknown) conditional expectation function for Y, where we place no restrictions on the functional form of f₀.

Let Y₁ and Y₀ be potential outcomes under treatment and control respectively, such that $Y = Y_{0} (1 - T) + Y_{1} T$ . The treatment effect for an individual is defined as Y₁ − Y₀, and the average treatment effect is defined as $E (Y_{1} - Y_{0})$ . Let $CATE (c) : = E (Y_{1} - Y_{0} | C = c)$ be the (unknown) conditional average treatment effect of T for all patients with C = c. Causal inference methods typically assume $(Y_{1} - Y_{0}) ⊥ T | C$ (conditional ignorability), and $0 < P (T = 1 | C = c) < 1$ for all values of c (positivity), in order for f₀ and CATE to be well defined and identifiable.

The next proposition quantifies the relation between the conditional average treatment effect function (CATE) and the model reliance of f₀ on X₁.

Proposition 19 (Causal interpretations of MR) For any prediction model f, let e_orig(f) and e_switch(f) be defined based on the squared error loss $L (f, (y, t, c)) : = {(y - f (t, c))}^{2}$ .

If $(Y_{1} - Y_{0}) ⊥ T | C$ (conditional ignorability) and $0 < P (T = 1 | C = c) < 1$ for all values of c (positivity), then MR(f₀) is equal to

1 + \frac{V a r (T)}{E_{T, C} V a r (Y | T, C)} \sum_{t \in {0, 1}} {E {(Y_{1} - Y_{0} | T = t)}^{2} + V a r (C A T E (C) | T = t)},

(8.1)

where Var(T) is the marginal variance of the treatment assignment.

We see above that model reliance decomposes into several terms that are each individually important in causal inference: the treatment prevalence (via Var(T)); the variability in Y that is not explained by C or T; the magnitude of the average treatment effect, conditional on T; and the variance of the conditional average treatment effect across subgroups. For example, if all patients are treated, then scrambling the treatment in a random pair of observations has no effect on the loss. In this case we see that Var(T) = 0 and MR(f₀) = 1, indicating no reliance. When Var(T) > 0, a higher average treatment effect magnitude $(E {(Y_{1} - Y_{0} | T = t)}^{2})$ corresponds to f₀ requiring T more heavily to predict Y, all else equal. Similarly, if there is a high degree of treatment effect heterogeneity across subgroups (that is, when Var(CATE(C)|T = t) is large), the model f₀ will again use T more heavily when predicting Y. For example, a treatment may be important for predicting Y even if the average treatment effect is zero, so long as the treatment helps some subgroups more than others.

8.2. Conditional Importance: Adjusting for Dependence Between X₁ and X₂

One common scenario where multiple models achieve low loss is when the sets of predictors X₁ and X₂ are highly correlated, or contain redundant information. Models may predict well either through reliance on X₁, or through reliance on X₂, and so MCR will correctly identify a wide range of potential reliances on X₁. However, we may specifically be interested how much models rely on the information in X₁ that cannot alternatively be gleaned from X₂.

For example, age and accumulated wealth may be correlated, and both may be predictive of future promotion. We may wish to know the how much a model for predicting promotion relies on information that is uniquely available from wealth measurements.

To formalize this notion, we define an alternative to e_switch where noise is added to X₁ in a way that accounts for the dependence between X₁ and X₂. Given a fixed prediction model f, we ask: how well would the model f perform if the values of X₁ were scrambled across observations with the same value for X₂. Specifically, let $Z^{(a)} = (Y^{(a)}, X_{1}^{(a)}, X_{2}^{(a)})$ and $Z^{(b)} = (Y^{(b)}, X_{1}^{(b)}, X_{2}^{(b)})$ denote a pair of independent random vectors following the same distribution as $Z = (Y, X_{1}, X_{2})$ , as in Section 3, and let

e_{cond} (f) : = E_{X_{2}} E_{(Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} [L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)} = X_{2}^{(b)} = X_{2}] .

(8.2)

In words, e_cond(f) is the expected loss of a given model f across pairs of observations (Z^(a), Z^(b)) in which the values of $X_{1}^{(a)}$ and $X_{1}^{(b)}$ have been switched, given that these pairs match on X₂. This quantity can also be interpreted as the expected loss of f if noise were added to X₁ in such a way that X₁ was no longer informative of Y, given X₂, but that the joint distribution of the covariates (X₁, X₂) was maintained.

We then define conditional model reliance, or “core” model reliance (CMR) for a fixed function f as

C M R (f) = \frac{e_{cond} (f)}{e_{orig} (f)} .

That is, CMR is the factor by which the model’s performance degrades when the information unique to X₁ is removed. If X₁ ⊥ X₂, then X₁ contains no redundant information, and CMR and MR are equivalent. Otherwise, all else equal, CMR will decrease as X₂ becomes more predictive of X₁. Analogous to MCR, we define conditional MCR (CMCR) in the same way as in Eq 2.2, but with MR replaced with CMR. In comparison with MCR, CMCR will generally result in a range that is closer to 1 (null reliance).

An advantage of CMR is that it restricts the “noise-corrupted” inputs to be within the domain $X$ , rather than the expanded domain $X_{1} \times X_{2}$ considered by MR. This means that CMR will not be influenced by impossible combinations of x₁ and x₂, while MR may be influenced by them. Hooker (2007) discuss a similar issue, arguing that evaluations of a prediction model’s behavior in different circumstances should be weighted by, for example, how likely those circumstances are to occur.

A challenge facing the CMR approach is that matched pairs such as those in Eq 8.2 may occur rarely, making it difficult to estimate CMR nonparametrically. We explore this estimation issue next.

8.2.1. Estimation of CMR by Weighting, Matching, or Imputation

If the covariate space is discrete and low dimensional, nonparametric methods based on weighting or matching can be effective means of estimating CMR. Specifically, we can weight each pair of sample points i, j according to how likely the covariate combination $(X_{1 [i, \cdot]}, X_{2 [j, \cdot]})$ is to occur, as in

{\hat{e}}_{weight} (f) : = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} ω (X_{1 [i, \cdot]}, X_{2 [j, \cdot]}) \times L {f, (Y_{[j]}, X_{1 [i, \cdot]}, X_{2 [j, \cdot]})},

where $w (x_{1}, x_{2}) : = \frac{P (X_{1} = x_{1} | X_{2} = x_{2})}{P (X_{1} = x_{1})}$ is an importance weight (see also Hooker, 2007). Here, pairs of observations corresponding to unlikely or impossible combinations of covariates are down-weighted or discarded, respectively. If the probabilities $P (X_{1} = x_{1} | X_{2} = x_{2})$ and $P (X_{1} = x_{1})$ are known, then ${\hat{e}}_{w e i g h t} (f)$ is unbiased for e_cond(f) (see Appendix A.7).

Alternatively, if X₂ is discrete and low dimensional, we can restrict estimates of e_cond(f) to only consider pairs of sample observations in which X₂ is constant, or “matched,” as in

{\hat{e}}_{match} (f) : = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} \frac{1 (X_{2 [j, \cdot]} = X_{2 [i, \cdot]})}{P (X_{2} = X_{2 [i, \cdot]})} \times L {f, (Y_{[j]}, X_{1 [i, \cdot]}, X_{2 [j, \cdot]})} .

(8.3)

This approach allows estimation of CMR without knowledge of the conditional distribution $P (X_{1} = x_{1} | X_{2} = x_{2})$ . If the inverse probability weight $P {(X_{2} = X_{2 [i, \cdot]})}^{- 1}$ is known, then ê_match(f) is unbiased for e_cond(f) (see Appendix A.7). The weight $P {(X_{2} = X_{2 [i, \cdot]})}^{- 1}$ accounts for the fact that, for any given value x₂, the proportion of observations of X₂ taking the value x₂ will generally not be the same as the proportion of matched pairs $(X_{2}^{(a)}, X_{2}^{(b)})$ taking value the x₂, and so simply summing over all matched pairs would lead to bias. In practice, the proportion $P (X_{2} = X_{2 [i, \cdot]})$ can be approximated as $\frac{1}{n - 1} \sum_{j^{'} \neq i} 1 (X_{2 [i, \cdot]} = X_{2 [j^{'}, \cdot]})$ , with minor adjustments to Eq 8.3 to avoid dividing by zero. The resulting estimate is analogous to exact matching procedures commonly used in causal inference, which are known to work best when the covariates are discrete and low dimensional, in order for exact matches to be common (Stuart, 2010).

However, when the covariate space is continuous or high dimensional, we typically cannot estimate CMR nonparametrically. For such cases, we propose to estimate CMR under an assumption of homogeneous residuals. Specifically, we define μ₁ to be the conditional expectation function $μ_{1} (x_{2}) = E (X_{1} | X_{2} = x_{2})$ , and assume that the random residual X − μ₁(X₂) is independent of X₂. Under this assumption, it can be shown that

e_{cond} (f) = E L [f, (Y^{(b)}, {X_{1}^{(a)} - μ_{1} (X_{2}^{(a)})} + μ_{1} (X_{2}^{(b)}, X_{2}^{(b)}))] .

That is, e_cond(f) is equal to the expected loss of f across random pairs of observations (Z^(a), Z^(b)) in which the value of the residual terms (in curly braces) have been switched. Because of the independence assumption, no matching or weighting is required. If μ₁ is known, then we can again produce an unbiased estimate using the U-statistic

{\hat{e}}_{impute} (f) : = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} L [f, (y_{[j]}, {X_{1 [i, \cdot]} - μ_{1} (X_{2 [i, \cdot]})} + μ_{1} (X_{2 [j, \cdot]}, X_{2 [j, \cdot]})] .

This estimator aggregates over all pairs in our sample, switching the values of the residual terms (in curly braces) within each pair. In practice, when μ₁ is not known, an estimate of μ₁ can be achieved via regression or related machine learning techniques, and plugged in to the above equation. In this way, the assumption that $X - μ_{1} (X_{2}) ⊥ X_{2}$ allows us to estimate CMR without explicitly modeling the joint distribution of X₁ and X₂.

In the existing literature, Strobl et al. (2008) introduce a similar procedure for estimating conditional variable importance. However, a formal comparison to Strobl et al. is complicated by the fact that the authors do not define a specific estimand, and that their approach is limited to tree-based regression models. Other existing approaches conditional importance approaches include methods for redefining X₁ and X₂ to induce approximate independence, before computing an importance measure analogous to MR. This can be done by reducing the total number of covariates used, and hence reducing how well any one variable can be predicted by the others (as in Gregorutti et al., 2017). Alternatively, variables in X₂ that are predictive of X₁ can be regrouped directly into X₁ (as in Toloşi and Lengauer, 2011; see also the discussion from Kirk, Lewin and Stumpf, in Meinshausen and Bühlmann 2010).

In summary, CMR allows us to see how much a model relies on the information uniquely available in X₁. While CMR is more difficult to estimate than MR, several tractable approaches exist when X₂ is discrete, or when a homogenous residual assumption can be applied. One may also consider extending CMR by conditioning only on a subset of X₂. For example, we may consider conditioning only on elements of X₂ that are believed to causally effect X₁, by changing the outer expectation in Eq 8.2. For simplicity, we focus on the base case of estimating MR in this paper. Similar results could potentially be carried over for CMR as well.

9. Simulations

In this section, we first present a toy example to illustrate the concepts of MR, MCR, and AR. We then present a Monte Carlo simulation studying the effectiveness of bootstrap CIs for MCR.

9.1. Illustrative Toy Example with Simulated Data

To illustrate the concepts of MR, MCR, and AR (see Section 3.2), we consider a toy example where $X = (X_{1}, X_{2}) \in R^{2}$ , and Y ∈ {−1, 1} is a binary group label. Our primary goal in this section is to build intuition for the differences between these three importance measures, and so we demonstrate them here only in a single sample. We focus on the empirical versions of our importance metrics $(\hat{M R}, {\hat{M C R}}_{-} and {\hat{M C R}}_{+})$ , and compare them against AR, which is typically interpreted as an in-sample measure (Breiman, 2001), or as an intermediate step to estimate an alternate importance measure in terms of variable rankings (Gevrey et al., 2003; Olden et al., 2004).

We simulate $X | Y = - 1$ from an independent, bivariate normal distribution with means $E (X_{1} | Y = - 1) = E (X_{2} | Y = - 1) = 0$ and variances $V a r (X_{1} | Y = - 1) = V a r (X_{2} | Y = - 1) = \frac{1}{9}$ . We simulate X|Y = 1 by drawing from the same bivariate normal distribution, and then adding the value of a random vector $(C_{1}, C_{2}) : = (\cos (U), \sin (U))$ , where U is a random variable uniformly distributed on the interval $[- π, π]$ . Thus, (C₁, C₂) is uniformly distributed across the unit circle.

Given a prediction model $f : X \to R$ , we use the sign of f(X₁, X₂) as our prediction of Y. For our loss function, we use the hinge loss $L (f, (y, x_{1}, x_{2})) = {(1 - y f (x_{1}, x_{2}))}_{+}$ , where ${(a)}_{+} = a$ if a ≥ 0 and (a)₊ = 0 otherwise. The hinge loss function is commonly used as a convex approximation to the zero-one loss $L (f, (y, x_{1}, x_{2})) = 1 [y \neq sign {f (x_{1}, x_{2})}]$ .

We simulate two samples of size 300 from the data generating process described above, one to be used for training, and one to be used for testing. Then, for the class of models used to predict Y, we consider the set of degree-3 polynomial classifiers

F_{d 3} = {f_{θ} : f_{θ} (x_{1}, x_{2}) = θ_{[1]} + θ_{[2]} x_{1} + θ_{[3]} x_{2} + θ_{[4]} x_{1}^{2} + θ_{[5]} x_{2}^{2} + θ_{[6]} x_{1} x_{2} + θ_{[7]} x_{1}^{3} + θ_{[8]} x_{2}^{3} + θ_{[9]} x_{1}^{2} x_{2} + θ_{[10]} x_{1} x_{2}^{2}; {‖ θ_{[- 1]} ‖}_{2}^{2} \leq r_{d 3}},

where θ_[−1] denotes all elements of θ except θ_[1], and where we set r_d3 to the value that minimizes the 10-fold cross-validated loss in the training data. Let $A_{d 3}$ be the algorithm that minimizes the hinge loss over the (convex) feasible region ${f_{θ} : {‖ θ_{[- 1]} ‖}_{2}^{2} \leq r_{d 3}}$ . We apply $A_{d 3}$ to the training data to determine a reference model f_ref. Also using the training data, we set ϵ equal to 0.10 multiplied by the cross-validated loss of $A_{d 3}$ , such that $R (ϵ, f_{ref}, F_{d 3})$ contains all models in $F_{d 3}$ that exceed the loss of f_ref by no more than approximately 10% (see Eq 4.1). We then calculate empirical AR, MR, and MCR using the test observations.

We begin by considering the AR of $A_{d 3}$ on X₁. Calculating AR requires us to fit two separate models, first using all of the variables to fit a model on the training data, and then again using only X₂. In this case, the first model is equivalent to f_ref. We denote the second model as ${\hat{f}}_{2}$ . To compute AR, we evaluate f_ref and ${\hat{f}}_{2}$ in the test observations. We illustrate this AR computation in Figure 6–A, marking the classification boundaries for f_ref and ${\hat{f}}_{2}$ by the black dotted line and the blue dashed lines respectively, and marking the test observations by labelled points (“x” for Y = 1, and “o” for Y = −1). Comparing the loss associated with these two models gives one form of AR–an estimate of the necessity of X₁ for the algorithm $A_{d 3}$ . Alternatively, to estimate the sufficiency of X₁, we can compare the reference model f_ref against the model resulting from retraining algorithm $A_{d 3}$ only using X₁. We refer to this third model as ${\hat{f}}_{1}$ , and mark its classification boundary by the solid blue lines in Figure 6–A.

Each of the classifiers in Figure 6–A can also be evaluated for its reliance on X₁, as shown in Figure 6–C. Here, we use ê_divide in our calculation of $\hat{M R}$ (see Eq 3.5). Unsurprisingly, the classifier fit without using X₁ (blue dashed line) has a model reliance of $\hat{M R} ({\hat{f}}_{2}) = 1$ . The reference model f_ref (dotted black line) has a model reliance of $\hat{M R} (f_{r e f}) = 3.47$ . Each MR value has an interpretation contained to a single model. That is, $\hat{M R}$ compares a single model’s behavior under different data distributions, rather than the AR approach of comparing different models’ behavior on marginal distributions from a single joint distribution.

We illustrate MCR in Figure 6–B. In contrast to AR, MCR is only ever a function of well-performing prediction models. Here, we consider the empirical ϵ-Rashomon set $\hat{R} (ϵ, f_{r e f}, F_{d 3})$ , the subset of models in $F_{d 3}$ with test loss no more than ϵ above that of f_ref. We show the classification boundary associated with 15 well-performing models contained in $\hat{R} (ϵ, f_{r e f}, F_{d 3})$ by the gray solid lines. We also show two of the models in $\hat{R} (ϵ, f_{r e f}, F_{d 3})$ that approximately maximize and minimize empirical reliance on X₁ among models in $\hat{R} (ϵ, f_{r e f}, F_{d 3})$ . We denote these models as ${\hat{f}}_{+, ϵ}$ and ${\hat{f}}_{-, ϵ}$ , and mark them by the solid green and dashed green lines respectively. For every model shown in Figure 6–B, we also mark its model reliance in Figure 6–C. We can then see from Figure 6–C that $\hat{M R}$ for each model in $\hat{R} (ϵ, f_{r e f}, F_{d 3})$ is contained between $\hat{M R} ({\hat{f}}_{-, ϵ})$ and $\hat{M R} ({\hat{f}}_{+, ϵ})$ , up to a small approximation error.

In summary, unlike AR, MCR is only a function of models that fit the data well.

9.2. Simulations of Bootstrap Confidence Intervals

In this section we study the performance of MCR under model class misspecification. Our goal will be to estimate how much the conditional expectation function $f_{0} (x) = E (Y | X = x)$ relies on subsets of covariates. Given a reference model f_ref and model class $F$ , our ability to describe MR(f₀) will hinge on two conditions:

Condition 20 (Nearly correct model class) The class $F$ contains a well-performing model $\tilde{f} \in R (ϵ, f_{r e f}, F)$ satisfying $M R (\tilde{f}) = M R (f_{0})$ (see Eq 4.1).

Condition 21 (Bootstrap coverage) Bootstrap CIs for empirical MCR give appropriate coverage of population-level MCR.

Condition 20 ensures that the interval [MCR₋(ϵ),MCR₊(ϵ)] contains MR(f₀), and Condition 21 ensures that this interval can be estimated in finite samples. Condition 20 can also be interpreted as saying that the model reliance value of MR(f_c) is “well supported” by the class $F$ , even if $F$ does not contain f₀. Our primary goal is to assess whether CIs derived from MCR can give appropriate coverage of MR(f₀), which depends on both conditions. As a secondary goal, we also would like to be able to assess Conditions 20 & 21 individually.

Verifying the above conditions requires that we are able to calculate population-level MCR. To this end, we draw samples with replacement from a finite population of 20,000 observations, in which MCR can also be calculated directly. To derive a CI based on MCR, we divide each simulated sample $Z_{s}$ into a training subset and analysis subset. We use the training subset to fit a reference model f_ref,s, which is required for our definition of population-level MCR. We calculate a bootstrap CI by drawing 500 bootstrap samples from the analysis subset, and computing ${\hat{M C R}}_{-} (ϵ)$ and ${\hat{M C R}}_{+} (ϵ)$ in each bootstrap sample by optimizing over $\tilde{R} (ϵ, f_{r e f, s}, F)$ . We then take the 2.5% percentile of ${\hat{M C R}}_{-} (ϵ)$ values across bootstrap samples, and the 97.5% percentile of ${\hat{M C R}}_{+} (ϵ)$ values across bootstrap samples, as the lower and upper endpoints of our CI, respectively. We repeat this procedure for both X₁ and X₂.

We generate data according to a model with increasing amounts of nonlinearity. For $γ \in {0, 0.1, 0.2, 0.3, 0.4, 0.5}$ , we simulate continuous outcomes as $Y = f_{0} (X) + E$ , where f₀ is the function $f_{0} (x) = \sum_{j = 1}^{p} j x_{[j]} - γ x_{[j]}^{2}$ the covariate dimension p is equal to 2, with X₁ and X₂ defined as the first and second elements of X; the covariates X are drawn from a multivariate normal distribution with $E (X_{1}) = E (X_{2}) = 0$ , $Var (X_{1}) = Var (X_{2}) = 0$ , and $Cov (X_{1}, X_{2}) = 1 / 4$ ; and E is a normally distributed noise variable with mean zero and variance equal to $σ_{E}^{2} : = Var (f_{0} (X))$ . We consider sample sizes of n = 400 and 800, of which n_tr = 200 or 300 observations are assigned to the training subset respectively.

To implement our approach, we use the model class $F_{lm} = {f_{β} : f_{β} (x) = β_{[1]} + \sum_{j = 1}^{2} x_{[j]} β_{[j + 1]}, β \in R^{3}}$ . We set the performance threshold ϵ equal to $0.1 \times σ_{E}^{2}$ . We refer to this MCR implementation with $F_{lm}$ as “MCR-Linear.”

As a comparator method, we consider a simpler bootstrap approach, which we refer to as “Standard-Linear.” Here, we take 500 bootstrap samples from the simulated data $Z_{s}$ . In each bootstrap sample, indexed by b, we set aside n_tr training points to train a model $f_{b} \in F_{lm}$ , and calculate $\hat{M R} (f_{b})$ from the remaining data points. We then create a 95% bootstrap percentile CI for MR(f₀) by taking the 2.5% and 97.5% percentiles of $\hat{M R} (f_{b})$ across b = 1, … , 500.

9.2.1. Results

Overall, we find that MCR provides more robust and conservative intervals for the reliance of f₀ on X₁ and X₂, relative to standard bootstrap approaches. We also find that higher sample size generally exacerbates coverage errors due to misspecification, as methods become more certain of biased results.

MCR-Linear gave proper coverage for up to moderate levels of misspecification (γ = 0.3), where Standard-Linear began to break down (Figure 7). For larger levels of misspecification (γ ≥ 0.4), both MCR-Linear and Standard-Linear failed to give appropriate coverage.

The increased robustness of MCR comes at the cost of wider CIs. Intervals for MCR-Linear were typically larger than intervals for Standard-Linear by a factor of approximately 2-4. This is partly due to the fact that CIs for MCR are meant to cover the range of values [MCR₋(ϵ), MCR₊(ϵ)] (defined using f_ref,s), rather than to cover a single point.

When investigating Conditions 20 & 21 individually, we find that the coverage errors for MCR-Linear were largely attributable to violations of Condition 20. Condition 21 appears to hold conservatively for all scenarios studied–within each scenario, at least 95.9% of bootstrap CIs contained population-level MCR.

These simulation results highlight an aspect of MCR that is both a strength and a weakness: MCR is generic. MCR does not assume a particular means by which misspecification may occur, and is less powerful than sensitivity analyses which make that assumption correctly. Nonetheless, MCR still appears to add robustness. For sufficiently strong signals, an informative interval may still be returned. In our applied data analysis, below, we see that this is indeed the case.

10. Data Analysis: Reliance of Criminal Recidivism Prediction Models on Race and Sex

Evidence suggests that bias exists among judges and prosecutors in the criminal justice system (Spohn, 2000; Blair et al., 2004; Paternoster and Brame, 2008). In an aim to counter this bias, machine learning models trained to predict recidivism are increasingly being used to inform judges’ decisions on pretrial release, sentencing, and parole (Monahan and Skeem, 2016; Picard-Fritsche et al., 2017). Ideally, prediction models can avoid human bias and provide judges with empirically tested tools. But prediction models can also mirror the biases of the society that generates their training data, and perpetuate the same bias at scale. In the case of recidivism, if arrest rates across demographic groups are not representative of underlying crime rate (Beckett et al., 2006; Ramchand et al., 2006; U.S. Department of Justice - Civil Rights Devision, 2016), then bias can be created in both (1) the outcome variable, future crime, which is measured imperfectly via arrests or convictions, and (2) the covariates, which include the number of prior convictions on a defendant’s record (Corbett-Davies et al., 2016; Lum and Isaac, 2016). Further, when a prediction model’s behavior and mechanisms are an opaque black box, the model can evade scrutiny, and fail to offer recourse or explanations to individuals rated as “high risk.”

We focus here on the issue of transparency, which takes an important role in the recent debate about the proprietary recidivism prediction tool COMPAS (Larson et al., 2016; Corbett-Davies et al., 2016). While COMPAS is known to not rely explicitly on race, there is concern that it may rely implicitly on race via proxies-variables statistically dependent with race (see further discussion in Section 11).

Our goal is to identify bounds for how much COMPAS relies on different covariate subsets, either implicitly or explicitly, under certain assumptions (defined below). We analyze a public data set of defendants from Broward County, Florida, in which COMPAS scores have been recorded (Larson et al., 2016). Within this data set, we only included defendants measured as African-American or Caucasian (3,373 in total) due to sparseness in the remaining categories. The outcome of interest (Y) is the COMPAS violent recidivism score. Of the available covariates, we consider three variables which we refer to as “admissible”: an individual’s age, their number of priors, and an indicator of whether the current charge is a felony. We also consider two variables which we refer to as “inadmissible”: an individual’s race and sex. Our labels of “admissible” and “inadmissible” are not intended to be legally precise-indeed, the boundary between these types of labels is not always clear (see Section 10.2). We compute empirical MCR and AR for each variable group, as well as bootstrap CIs for MCR (see Section 9.2).

To compute empirical MCR and AR, we consider a flexible class of linear models in a RKHS to predict the COMPAS score (described in more detail below). Given this class, the MCR range (See Eq 2.2) captures the highest and lowest degree to which any model in the class may rely on each covariate subset. We assume that our class contains at least one model that relies on “inadmissible variables” to the same extent that COMPAS relies either on “inadmissible variables” or on proxies that are unmeasured in our sample (analogous to Condition 20). We make the same assumption for “admissible variables.” These assumptions can be interpreted as saying that the reliance values of COMPAS are relatively “well supported” by our chosen model class, and allows us to identify bounds on the MR values for COMPAS. We also consider the more conventional, but less robust approach of AR (Section 3.2), that is, how much would the accuracy suffer for a model-fitting algorithm trained on COMPAS score if a variable subset was removed?

These computations require that we predefine our loss function, model class, and performance threshold. We define MR, MCR, and AR in terms of the squared error loss $L (f, (y, x_{1}, x_{2})) = {y - f (x_{1}, x_{2})}^{2}$ . We define our model class $F_{D, r_{k}}$ in the form of Eq 7.6, where we determine D, μ, k, and r_k based on a subset $S$ of 500 training observations. We set D equal to the matrix of covariates from $S$ ; we set μ equal to the mean of Y in $S$ ; we set k equal to the radial basis function $k_{σ s} (x, \tilde{x}) = \exp (- \frac{{‖ x - \tilde{x} ‖}^{2}}{2 σ_{s}})$ , where we choose σ_s to minimize the cross-validated loss of a Nadaraya-Watson kernel regression (Hastie et al., 2009) fit to $S$ ; and we select the parameters r_k by cross-validation on $S$ . We set ϵ equal to 0.1 times the cross-validated loss on $S$ . Also using $S$ , we train a reference model $f_{ref} \in F_{D, r_{k}}$ Using the held-out 2,873 observations, we then estimate MR(f_ref) and MCR for $F_{D, r_{k}}$ . To calculate AR, we train models from $F_{D, r_{k}}$ using $S$ , and evaluate their performance in the held-out observations.

10.1. Results

Our results imply that race and sex play somewhere between a null role and a modest role in determining COMPAS score, but that they are less important than “admissible” factors (Figure 8). As a benchmark for comparison, the empirical MR of f_ref is equal to 1.09 for “inadmissible variables,” and 2.78 for “admissible variables.” The AR is equal to 0.94 and 1.87 for “inadmissible” and “admissible” variables respectively, roughly in agreement with MR. The MCR range for “inadmissible variables” is equal to [1.00, 1.56], indicating that for any model in $F_{D, r_{k}}$ with empirical loss no more than ϵ above that of f_ref, the model’s loss can increase by no more than 56% if race and sex are permuted. Such a statement cannot be made solely based on AR or MR methods, as these methods do not upper bound the reliance values of well-performing models. The bootstrap 95% CI for MCR on “inadmissible variables” is [1.00, 1.73]. Thus, under our assumptions, if COMPAS relied on sex, race, or their unmeasured proxies by a factor greater than 1.73, then intervals as low as what we observe would occur with probability < 0.05.

For “admissible variables” the MCR range is equal to [1.77,3.61], with a 95% bootstrap CI of [1.62, 3.96]. Under our assumptions, this implies if COMPAS relied on age, number of priors, felony indication, or their unmeasured proxies by a factor lower than 1.77, then intervals as high as what we observe would occur with probability < 0.05. This result is consistent with Rudin et al. (2019), who find age to be highly predictive of COMPAS score.

It is worth noting that the upper limit of 3.61 maximizes empirical MR on “admissible variables” not only among well-performing models, but globally across all models in the class (see Figure 8, and Eq 6.5). In other words, it is not possible to find models in $F_{D, r_{k}}$ that perform arbitrarily poorly on perturbed data, but still perform well on unperturbed data, and so the ratio of ê_switch(f) to ê_orig(f) has a finite upper bound. Because the regularization constraints of $F_{D, r_{k}}$ preclude MR values higher than 3.61, the MR of COMPAS on “admissible variables” may be underestimated by empirical MCR. Note also that both MCR intervals are left-truncated at 1, as it is often sufficiently precise to conclude that there exists a well-performing model with no reliance on the variables of interest (that is, MR equal to 1; see Appendix A.2).

10.2. Discussion & Limitations

Asking whether a proprietary model relies on sex and race, after adjusting for other covariates, is related to the fairness metric known as conditional statistical parity (CSP). A decision rule satisfies CSP if its decisions are independent of a sensitive variable, conditional on a set of “legitimate” covariates C (Corbett-Davies et al., 2017; see also Kamiran et al., 2013). Roughly speaking, CSP reflects the idea that groups of people with similar covariates C are treated similarly (Dwork et al., 2012), regardless of the sensitive variable (for example, race or sex). However, the criteria becomes superficial if too many variables are included in C, and care should be taken to avoid including proxies for the sensitive variables. Several other fairness metrics have also been proposed, which often form competing objectives (Kleinberg et al., 2017; Chouldechova, 2017; Nabi and Shpitser, 2018; Corbett-Davies et al., 2017). Here, if COMPAS was not influenced by race, sex, or variables related to race or sex (conditional on a set of “legitimate” variables), it would satisfy CSP.

Unfortunately, it is often difficult to distinguish between “legitimate” (or “admissible”) variables and “illegitimate” variables. Some variables function both as part of a reasonable predictor for risk, and, separately, as a proxy for race. Because of disproportional arrest rates, particularly for misdemeanors and drug-related offenses (U.S. Department of Justice - Civil Rights Devision, 2016; Lum and Isaac, 2016), prior misdemeanor convictions may act as such a proxy (Corbett-Davies et al., 2016; Lum and Isaac, 2016).

Proxy variables for race (defined as being statistically dependent with race) that are unmeasured in our sample are also not the only reason that race could be predictive of COMPAS score. Other inputs to the COMPAS algorithm might be associated with race only conditionally on variables we categorize as “admissible.” However, our result from Section 10.1 that race has limited predictive utility for COMPAS score suggests that such conditional relationships are also limited.

11. Conclusion

In this article, we propose MCR as the upper and lower limit on how important a set of variables can be to any well-performing model in a class. In this way, MCR provides a more comprehensive and robust measure of importance than traditional importance measures for a single model. We derive bounds on MCR, which motivate our choice of point estimates. We also derive connections between permutation importance, U-statistics, conditional variable importance, and conditional causal effects. We apply MCR in a data set of criminal recidivism, in order to help inform the characteristics of the proprietary model COMPAS.

Several exciting areas remain open for future research. One research direction closely related to our current work is the development of exact or approximate MCR computation procedures for other model classes and loss functions. We have shown that, for model classes where minimizing the empirical loss is a convex optimization problem, MCR can be conservatively computed via a series of convex optimization problems. Further, we have shown that computing ${\hat{M C R}}_{-}$ is often no more challenging that minimizing the empirical loss over a reweighted sample. General computation procedures for MCR are still an open research area.

Another direction is to consider MCR for variable selection. If MCR₊ is small for a variable, then no well-performing predictive model can heavily depend on that variable, indicating that it can be eliminated.

Our theoretical analysis of Rashomon sets depends on $F$ and f_ref being prespecified. Above, we have actualized this by splitting our sample into subsets of size n₁ and n₂, using the first subset to determine $F$ and f_ref, and conditioning on $F$ and f_ref when estimating MCR in the second subset. As a result, the boundedness constants in our assumptions (B_ind, B_ref, B_switch, and b_orig) depend on $F$ , and hence on n₁. However, because our results are non-asymptotic, we have not explored how Rashomon sets behave when n₁ and n₂ grow at different rates. An exciting future extension of this work is to study sequences of triples ${ϵ_{n_{1}}, f_{{ref, n}_{1}}, F_{n_{1}}}$ that change as n₁ increases, and the corresponding Rashomon sets $R (ϵ_{n_{1}}, f_{{ref, n}_{1}}, F_{n_{1}})$ , as this may more thoroughly capture how model classes are determined by analysts.

While we develop Rashomon sets with the goal of studying MR, Rashomon sets can also be useful for finite sample inferences about a wide variety of other attributes of best-in-class models (for example, Section 5). Characterizations of a Rashomon set itself may also be of interest. For example, in ongoing work, we are studying the size of a Rashomon set, and its connection to generalization of models and model classes (Semenova and Rudin, 2019). We are additionally developing methods for visualizing Rashomon sets (Dong and Rudin, 2019).

Acknowledgments

Support for this work was provided by the National Institutes of Health (grants P01CA134294, R01GM111339, R01ES024332, R35CA197449, R01ES026217, P50MD010428, DP2MD012722, R01MD012769, & R01ES028033), by the Environmental Protection Agency (grants 83615601 & 83587201-0), and by the Health Effects Institute (grant 4953-RFA14-3/16-4).

Appendix A. Miscellaneous Supplemental Sections

All labels for items in the following appendices begin with a letter (for example, Section A.2), while references to items in the main text contain only numbers (for example, Proposition 19).

A.1. Code

R code for our example in Section 9.1 and analysis in Section 10 is available at https://github.com/aaronjfisher/mcr-supplement.

A.2. Model Reliance Less than 1

While it is counterintuitive, it is possible for the expected loss of a prediction model to decrease when the information in X₁ is removed. Roughly speaking, a “pathological” model f_silly may use the information in X₁ to “intentionally” misclassify Y, such that e_switch(f_silly) < e_orig(f_silly) and MR(f_silly) < 1. The model f_silly may even be included in a population ϵ-Rashomon set (see Section 4) if it is still possible to predict Y sufficiently well from the information in X₂.

However, in these cases there will often exist another model that outperforms f_silly, and that has MR equal to 1 (i.e., no reliance on X₁). To see this, consider the case where $F = {f_{θ} : θ \in R^{d}}$ is indexed by a parameter θ. Let θ_silly and θ^⋆ be parameter values such that f_{θ_silly} is equivalent to f_silly, and f_θ^⋆ is the best-in-class model. If f_θ^⋆ satisfies MR(f_θ^⋆) > 1 and if the model reliance function MR is continuous in θ, then there exists a parameter value θ₁ between θ_silly and θ^⋆ such that MR(f_θ₁) = 1. Further, if the loss function L is convex in θ, then e_orig(f_θ^⋆) ≤ e_orig(f_silly), and any population ϵ-Rashomon set containing f_silly will also contain f_θ₁.

A.3. Relating ${\hat{e}}_{switch} (f)$ to All Possible Permutations of the Sample

Following the notation in Section 3, let [π₁,…, π_n!} be a set of n-length vectors, each containing a different permutation of the set {1,…,n}. We show in this section that ${\hat{e}}_{switch} (f)$ is equal to the product of

\sum_{l = 1}^{n!} \sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [π_{l [i]}, \cdot]}, X_{2 [i, \cdot]})} 1 (π_{l [i]} \neq i),

(A.1)

and a proportionality constant that is only a function of n.

First, consider the sum

\sum_{l = 1}^{n!} \sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [π_{l [i]}, \cdot]}, X_{2 [i, \cdot]})},

(A.2)

which omits the indicator function found in Eq A.1.

The summation in Eq A.2 contains n(n!) terms, each of which is a two-way combination of the form L{f, (y_[i], X_1[j,·], X_2[i,·])} for i,j ∈ {1, … n}. There are only n² unique combinations of this form, and each must occur in at least (n − 1)! of the n(n!) terms in Eq A.2. To see this, consider selecting two integer values $\tilde{i}, \tilde{j} \in {1, \dots n}$ , and enumerating all occurrences of the term $L {f, (y_{[\tilde{i}]}, X_{1 [\tilde{j}, \cdot]}, X_{2 [\tilde{i}, \cdot]})}$ within the sum in Eq A.2. Of the permutation vectors [π₁,…, π_n!}, we know that (n − 1)! of them place $\tilde{i}$ in the ${\tilde{j}}^{th}$ position, i.e., that satisfy $π_{l [\tilde{i}]} = \tilde{j}$ . For each such permutation π_l, the inner summation in Eq A.2 over all possible values of i must include the term $L {f, (y_{[\tilde{i}]}, X_{1 [π_{l [\tilde{i}]}, \cdot]}, X_{2 [\tilde{i}, \cdot]})} = L {f, (y_{[\tilde{i}]}, X_{1 [\tilde{j}, \cdot]}, X_{2 [\tilde{i}, \cdot]})}$ . Thus, Eq A.2 contains at least (n − 1)! occurrences of the term $L {f, (y_{[\tilde{i}]}, X_{1 [\tilde{j}, \cdot]}, X_{2 [\tilde{i}, \cdot]})}$ .

So far, we have shown that each unique combination occurs at least (n − 1)! times, but it also follows that each unique combination must occur precisely (n − 1)! times. This is because each of the n² unique combinations must occur at least (n − 1)! times, which accounts for n²((n − 1)!) = n(n!) terms in total. As noted above, Eq has A.2 has only n(n)! terms, so there can be no additional terms. We can then simplify Eq A.2 as

\sum_{l = 1}^{n!} \sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [π_{l [i]}, \cdot]}, X_{2 [i, \cdot]})} = (n - 1)! \sum_{i = 1}^{n} \sum_{j = 1}^{n} L {f, (y_{[i]}, X_{1 [j, \cdot]}, X_{2 [i, \cdot]})} .

By the same logic, we can simplify Eq A.1 as

\begin{array}{l} \sum_{l = 1}^{n!} \sum_{i = 1}^{n} L {f, (y_{[i]}, X_{1 [π_{l [i]}, \cdot]}, X_{2 [i, \cdot]})} 1 (π_{l [i]} \neq i) \\ = (n - 1)! {\sum_{i = 1}^{n} \sum_{j = 1}^{n} L {f, (y_{[i]}, X_{1 [j, \cdot]}, X_{2 [i, \cdot]})} 1 (j \neq i)} \\ = (n - 1)! \sum_{i = 1}^{n} \sum_{j \neq 1}^{} L {f, (y_{[i]}, X_{1 [j, \cdot]}, X_{2 [i, \cdot]})}, \end{array}

and Line A.3 is proportional to ê_switch(f) up to a function of n.

A.4. Bound for MR of the Best-in-class Prediction Model

Although describing individual models is not the primary focus of this work, a corollary of Theorem 4 is that we can create a probabilistic bound for the reliance of the (unknown) best-in-class model f^⋆ on X₁.

Corollary 22 (Bound on Best-in-class MR) Let $f^{⋆} \in {arg min}_{f \in F} e_{orig} (f)$ be a prediction model that attains the lowest possible expected loss, and let f_+,ϵ and f_−,ϵ be defined as in Theorem 4· If f_+,ϵ and f_−,ϵ satisfy Assumptions 1, 2 and 3, then

P (M R (f^{⋆}) \in [{\hat{MCR}}_{-} (ϵ_{b e s t}) - Q_{b e s t}, {\hat{MCR}}_{+} (ϵ_{b e s t}) + Q_{b e s t}]) \geq 1 - δ,

where $ϵ_{b e s t} : = 2 B_{r e f} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}},$ and $Q_{b e s t} : = \frac{B_{s w i t c h}}{b_{o r i g}} - \frac{B_{s w i t c h} - B_{i n d} \sqrt{\frac{\log (12 δ^{- 1})}{n}}}{b_{o r i g} + B_{i n d} \sqrt{\frac{\log (12 δ^{- 1})}{2 n}}} .$

The above result does not require that f^⋆ be unique. If several models achieve the minimum possible expected loss, the above boundaries apply simultaneously for each of them. In the special case when the true conditional expectation function $E (Y | X_{1}, X_{2})$ is equal to f^⋆, then we have a boundary for the reliance of the function $E (Y | X_{1}, X_{2})$ on X₁. This reliance bound can also be translated into a causal statement using Proposition 19.

A.5. Ratios versus Differences in MR Definition

We choose our ratio-based definition of model reliance, $M R (f) = \frac{e_{switch} (f)}{e_{orig} (f)}$ , so that the measure can be comparable across problems, regardless of the scale of Y. However, several existing works define VI measures in terms of differences (Strobl et al., 2008; Datta et al., 2016; Gregorutti et al., 2017), analogous to

M R_{difference} (f) : = e_{switch} (f) - e_{orig} (f) .

(A.4)

While this difference measure is less readily interpretable, it has several computational advantages. The mean, variance, and asymptotic distribution of the estimator ${\hat{M R}}_{difference} (f) : = {\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)$ can be easily determined using results for U-statistics, without the use of the delta method (Dorfman, 1938; Lehmann and Casella, 2006; see also Ver Hoef, 2012). Estimates in the form of ${\hat{M R}}_{difference} (f)$ will also be more stable when $\min_{f \in F} e_{orig} (f)$ is small, relative to estimates for the ratio-based definition of MR. To improve interpretability, we may also normalize MR_difference(f) by dividing by the variance of Y, which can be easily estimated without the use of models, as in Williamson et al. (2017).

Under the difference-based definition for MR (Eq A.4), the results from Theorem 4, Theorem 6, and Corollary 22 will still hold under the following modified definitions of $Q_{out}$ , $Q_{in}$ , and $Q_{best}$ :

Q_{out, difference} : = (1 + \frac{1}{\sqrt{2}}) B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}},

Q_{in, difference} : = B_{ind} {\sqrt{\frac{\log (8 δ^{- 1} N (F, r \sqrt{2}))}{n}} + \sqrt{\frac{\log (8 δ^{- 1} N (F, r))}{2 n}}} + 2 r (\sqrt{2} + 1), and

Q_{best, difference} : = (1 + \frac{1}{\sqrt{2}}) B_{ind} \sqrt{\frac{\log (12 δ^{- 1})}{n}} .

Respectively replacing $Q_{out}$ , $Q_{in}$ , $Q_{best}$ , MR, and $\hat{M R}$ with $Q_{out,difference}$ , $Q_{in,difference}$ , $Q_{best,difference}$ , MR_difference and ${\hat{M R}}_{difference}$ entails only minor changes to the corresponding proofs (see Appendices B.3, B.5, and B.4). The results will also hold without Assumption 3, as is suggested by the fact that b_orig and B_switch do not appear in $Q_{out,difference}$ , $Q_{in,difference}$ , or $Q_{best,difference}$ .

We also prove an analogous version of Theorem 5, on uniform bounds for ${\hat{M R}}_{difference}$ , in Appendix B.5.1.

A.6. Rashomon Sets and Profile Likelihood Intervals

We note in Section 5.1 that, under certain conditions, the CIs returned from Proposition 7 take the same form as profile likelihood CIs (Coker et al., 2018). For completeness, we briefly review this connection. We assume here that models $f_{θ} \in F$ are indexed by a finite dimensional parameter vector θ ∈ Θ, where θ = (γ, ψ) contains a 1-dimensional parameter of interest $γ \in R^{1}$ , and a nuisance parameter ψ ∈ Ψ. We further assume and that e_orig(f_θ) is minimized by a unique parameter value θ^⋆ = (γ^⋆, ψ^⋆) ∈ Θ, and that our goal is to learn about γ^⋆.

If $s_{θ} : = \int_{Z} \exp {- L (f_{θ}, z)} d z$ is finite for all θ ∈ Θ, we can convert L into the likelihood function $L : (Z \times Θ) \to R^{1}$ satisfying $L (z; θ) = \exp {- L (f_{θ}, z)} / s_{θ}$ . As an abbreviation, let $L (Z; θ)$ denote $\prod_{i = 1}^{n} L (Z_{[i, \cdot]}; θ)$ . Additionally, let $\hat{θ} : = {arg min}_{θ \in Θ} {\hat{e}}_{orig} (f_{θ})$ be the empirical loss minimizer, and hence the maximum likelihood estimator of θ^⋆. If $L$ is indeed the correct likelihood function, then θ^⋆ = (γ^⋆, ψ^⋆) corresponds to the true parameter vector. Further, if ϕ(f_θ) = ϕ(f_{(γ, ψ)}) = γ returns the parameter element of interest (γ), then the (1 − δ)-level profile likelihood interval for ϕ(f_θ^⋆) = γ^⋆ is

PLI (δ) : = {γ : \log L (Z; \hat{θ}) - \log L (Z; {\hat{θ}}_{γ}) \leq \frac{X_{1, 1 - δ}}{2}, where {\hat{θ}}_{γ} = \underset{{θ \in Θ : ϕ (f_{θ}) = γ}}{arg max} L (Z; θ)} = {γ : \exists {\hat{θ}}_{γ} satisfying ϕ (f_{{\hat{θ}}_{γ}}) = γ and \log L (Z; \hat{θ}) - \log L (Z; {\hat{θ}}_{γ}) \leq \frac{X_{1, 1 - δ}}{2}} = {γ : \exists {\hat{θ}}_{γ} satisfying ϕ (f_{{\hat{θ}}_{γ}}) = γ and {\hat{e}}_{orig} (f_{{\hat{θ}}_{γ}}) \leq {\hat{e}}_{orig} (f_{\hat{θ}}) + \frac{X_{1, 1 - δ}}{2 n}} = {γ : \exists f_{{\hat{θ}}_{γ}} satisfying ϕ (f_{{\hat{θ}}_{γ}}) = γ and f_{{\hat{θ}}_{γ}} \in \hat{R} (\frac{X_{1, 1 - δ}}{2 n}, f_{\hat{θ}}, F)}

(A.5)

where $X_{1, 1 - δ}$ is the 1 − δ percentile of a chi-square distribution with 1 degree of freedom. If PLI(α) is indeed a contiguous interval, then maximizing and minimizing ϕ(f_θ) across models f_θ in the empirical Rashomon set in Eq A.5 yields the same interval.

A.7. Unbiased Estimates of CMR

We claim in Section 8.2 that both

{\hat{e}}_{match} (f) = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} \frac{1 (X_{2 [j, \cdot]} = X_{2 [i, \cdot]})}{P (X_{2} = X_{2 [i, \cdot]})} \times L {f, (y_{[j]}, X_{1 [i, \cdot]}, X_{2 [j, \cdot]})} .

and

{\hat{e}}_{weight} (f) = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} \frac{P (X_{1} = X_{1 [i, \cdot]} | X_{2} = X_{2 [j, \cdot]})}{P (X_{1} = X_{1 [i, \cdot]})} \times L {f, (y_{[j]}, X_{1 [i, \cdot]}, X_{2 [j, \cdot]})},

are unbiased for

e_{cond} (f) = E_{X_{2}} E [L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)} = X_{2}^{(b)}, X_{2}] .

To show that ê_match(f) is unbiased, we first note that each summation term in ê_match(f) has the same expectation. Following the notation in Section 3, let $Z^{(a)} = (Y^{(a)}, X_{1}^{(a)}, X_{2}^{(a)})$ and $Z^{(b)} = (Y^{(b)}, X_{1}^{(b)}, X_{2}^{(b)})$ be independent random variables following the same distribution as Z = (Y, X₁, X₂). The expectation of ê_match(f) is

\begin{array}{l} E {\hat{e}}_{match} (f) = E [\frac{1 (X_{2}^{(a)} = X_{2}^{(b)})}{p_{x_{2}} (X_{2}^{(a)})} \times L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})}] \\ = E_{X_{2}^{(a)}} E [\frac{1 (X_{2}^{(a)} = X_{2}^{(b)})}{p_{x_{2}} (X_{2}^{(a)})} \times L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)}] \\ = E_{X_{2}^{(a)}} {p_{x_{2}} (X_{2}^{(a)}) E [\frac{1}{p_{x_{2}} (X_{2}^{(a)})} \times L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)} = X_{2}^{(b)}, X_{2}^{(a)}] + 0} \\ = E_{X_{2}^{(a)}} E [L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)} = X_{2}^{(b)}, X_{2}^{(a)}] \\ = e_{cond} (f) . \end{array}

To show that ${\hat{e}}_{weight} (f)$ is unbiased, we similarly note that each summation term in ${\hat{e}}_{weight} (f)$ has the same expectation. Without loss of generality, we show the result for discrete variables (Y, X₁, X₂). Let $Y_{x_{2}}$ be the domain of Y conditional on the event that X₂ = x₂. The expectation of ${\hat{e}}_{weight} (f)$ is

\begin{array}{l} E {\hat{e}}_{weight} (f) = \sum_{x_{2}^{(b)} \in X_{2}} \sum_{y^{(b)} \in Y_{x_{2}^{(b)}}} \sum_{x_{1}^{(a)} \in X_{1}} [L {f, (y^{(b)}, x_{1}^{(a)}, x_{2}^{(b)})} {\frac{P (X_{1} = x_{1}^{(a)} | X_{2} = x_{2}^{(b)})}{P (X_{1} = x_{1}^{(a)})}} \\ \times P (X_{1} = x_{1}^{(a)}) P (Y = y^{(b)}, X_{2} = x_{2}^{(b)})] \\ = \sum_{x_{2}^{(b)} \in X_{2}} P (X_{2} = x_{2}^{(b)}) \sum_{y^{(b)} \in Y_{x_{2}^{(b)}}} \sum_{x_{1}^{(a)} \in X_{1}} [L {f, (y^{(b)}, x_{1}^{(a)}, x_{2}^{(b)})} \\ \times P (X_{1} = x_{1}^{(a)} | X_{2} = x_{2}^{(b)}) P (Y = y^{(b)} | X_{2} = x_{2}^{(b)})] \\ = E_{X_{2}^{(b)}} E [\int L {f, (Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)})} | X_{2}^{(a)} = X_{2}^{(b)}, X_{2}^{(b)})] \\ = e_{cond} (f) . \end{array}

Appendix B. Proofs for Statistical Results

We present proofs for our statistical results in this section, and conclude by presenting proofs for our computational results in Appendix C.

B.1. Lemma Relating Empirical and Population Rashomon Sets

Throughout the remaining proofs, it will be useful to express the definition of population ϵ-Rashomon sets in terms of the expectation of a single loss function, rather than a comparison of two loss functions. To do this, we simply introduce the “standardized” loss function $\tilde{L}$ , defined as

\tilde{L} (f, z) : = L (f, z) - L (f_{r e f}, z) .

(B.1)

Above, recall from Section 2 that L(f, z) denotes L(f, (y, x₁, x₂)) for z = (y, x₁, x₂). Because we assume f_ref is prespecified and fixed, we omit notation for f_ref in the definition of $\tilde{L}$ . We can now write

\begin{array}{l} R (ϵ) = {f_{ref}} ⋃ {f \in F : E L (f, Z) \leq E L (f_{ref}, Z) + ϵ} \\ = {f_{ref}} ⋃ {f \in F : E \tilde{L} (f, Z) \leq ϵ}, \end{array}

and, similarly,

\hat{R} (ϵ) = {f_{ref}} ⋃ {f \in F : \hat{E} \tilde{L} (f, Z) \leq ϵ} .

With this definition, the following lemma allows us to limit the probability that a given model $f_{1} \in \hat{R} (ϵ)$ is excluded from an empirical Rashomon set.

Lemma 23 For $ϵ \in R$ and δ ∈ (0, 1), let $ϵ_{1}^{'} : = ϵ + 2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}$ , and let $f_{1} \in R (ϵ)$ denote a specific, possibly unknown prediction model. If f₁ satisfies Assumption 2, then

P {f_{1} \in \hat{R} (ϵ_{1}^{'})} \geq 1 - δ .

Proof If f_ref and f₁ are the same function, then the result holds trivially. Otherwise, the proof follows from Hoeffding’s inequality (Theorem 2 of Hoeffding, 1963). First, note that if f₁ satisfies Assumption 2, then $\tilde{L} (f_{1})$ is bounded within an interval of length 2B_ref. Applying this in line B.3, below, we see that

\begin{matrix} P {f_{1} \notin \hat{R} (ϵ_{1}^{'})} = P [\hat{E} \tilde{L} (f_{1}, Z) > ϵ_{1}^{'}] & from f_{1} \notin {f_{ref}} \\ = P [\hat{E} \tilde{L} (f_{1}, Z) - ϵ > 2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}] & from definition of ϵ_{1}^{'} \end{matrix}

(B.2)

\begin{array}{l} \leq P [\hat{E} \tilde{L} (f_{1}, Z) - E \tilde{L} (f_{1}, Z) > 2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}] & from E \tilde{L} (f_{1}, Z) \leq ϵ \\ \leq \exp {- \frac{2 n}{{(2 B_{r e f})}^{2}} {[2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}]}^{2}} & from Hoeffding’s inequality \end{array}

(B.3)

= δ .

(B.4)

For the inequality used in Line B.3, see Theorem 2 of Hoeffding, 1963. ■

B.2. Lemma to Transform Between Bounds

The following lemma will help us translate from bounds for variables to bounds for differences and ratios of those variables. We will apply this lemma to transform from bounds on empirical losses to bounds on empirical model reliance, defined either in terms of a ratio or in terms of a difference.

Lemma 24 Let X, Z, μ_X ,μ_Z, k_X, $k_{Z} \in R$ be constants satisfying |Z − μ_Z| ≤ k_Z and |X − μ_X| ≤ k_X, then

| (Z - X) - (μ_{Z} - μ_{X}) | \leq q_{d i f f e r e n c e} (k_{Z}, k_{X}),

(B.5)

where q_difference is the function

q_{d i f f e r e n c e} (k_{Z}, k_{X}) : = k_{Z} + k_{X} .

(B.6)

Further, if there exists constants b_orig and B_switch such that 0 < b_orig ≤ X, μ_X and Z, μ_Z ≥ B_switch < ∞, then

| \frac{Z}{X} - \frac{μ_{Z}}{μ_{X}} | \leq q_{r a t i o} (k_{Z}, k_{X}),

(B.7)

where q_ratio is the function

q_{r a t i o} (k_{Z}, k_{X}) : = \frac{B_{s w i t c h}}{b_{o r i g}} - \frac{B_{s w i t c h} - k_{Z}}{b_{o r i g} + k_{X}} .

(B.8)

Proof Showing Eq B.5,

\begin{array}{l} | (Z - X) - (μ_{Z} - μ_{X}) | \leq | Z - μ_{Z} | + | μ_{X} - X | \\ \leq k_{Z} + k_{X} . \end{array}

Showing Eq B.7, let A_z = max(Z, μ_z), a_X = min(X, μ_X), d_Z = |Z − μ_Z|, and d_X = |X − μ_X|. This implies that max(X, μ_X) = a_X + d_X and min(Z, μ_Z) = A_Z − d_Z. Thus, $\frac{Z}{X}$ and $\frac{μ Z}{μ X}$ are both bounded within the interval

[\frac{\min (Z, μ_{Z})}{\max (X, μ_{X})}, \frac{\max (Z, μ_{Z})}{\min (X, μ_{X})}] = [\frac{A_{Z} - d_{Z}}{a_{X} + d_{X}}, \frac{A_{Z}}{a_{X}}],

which implies

| \frac{Z}{X} - \frac{μ_{Z}}{μ_{X}} | \leq \frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}} .

(B.9)

Taking partial derivatives of the right-hand side, we get

\frac{\partial}{\partial a_{X}} (\frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}}) = \frac{- A_{Z}}{a_{X}^{2}} + \frac{A_{Z} - d_{Z}}{{(a_{X} + d_{X})}^{2}} \leq 0,

\frac{\partial}{\partial A_{Z}} (\frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}}) = \frac{1}{a_{X}} - \frac{1}{a_{X} + d_{X}} \geq 0,

\frac{\partial}{\partial d_{X}} (\frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}}) = \frac{A_{Z} - d_{Z}}{{(a_{X} + d_{X})}^{2}} > 0,

and \frac{\partial}{\partial d_{Z}} (\frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}}) = \frac{1}{a_{X} + d_{X}} > 0.

So the right-hand side of B.9 is maximized when d_Z, d_X, and A_Z are maximized, and when a_X is minimized. Thus, in the case where |Z − μ_Z| ≤ k_Z; |X − μ_X| ≤ k_X; 0 < b_orig ≤ X, μ_X ; and Z, μ_Z ≤ B_switch < ∞, we have

\begin{array}{l} | \frac{Z}{X} - \frac{μ Z}{μ_{X}} | \leq \frac{A_{Z}}{a_{X}} - \frac{A_{Z} - d_{Z}}{a_{X} + d_{X}} \\ \leq \frac{B_{switch}}{b_{orig}} - \frac{B_{switch} - k_{Z}}{b_{orig} + k_{X}} . \end{array}

B.3. Proof of Theorem 4

Proof We proceed in 4 steps.

B.3.1. Step 1: Show that $P [\hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{OUT})] \geq 1 - \frac{δ}{3} .$

Consider the event that

\hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out}) .

(B.10)

Eq B.10 will always hold if $f_{+, ϵ} \in \hat{R} (ϵ_{out})$ , since ${\hat{M C R}}_{+} (ϵ_{out})$ upper bounds the empirical model reliance for models in $\hat{R} (ϵ_{out})$ by definition. Applying the above reasoning in Line B.11, below, we get

P [\hat{M R} (f_{+, ϵ}) > {\hat{M C R}}_{+} (ϵ_{out})] \leq P [f_{+, ϵ} \notin \hat{R} (ϵ_{out})]

(B.11)

\leq \frac{δ}{3} from ϵ_{out} definition and Lemma 23.

(B.12)

B.3.2. Step 2: Conditional on $\hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out})$ , Upper Bound MR(f_+,ϵ) by ${\hat{M C R}}_{+} (ϵ_{out})$ Added to an Error Term.

When Eq B.10 holds we have,

\begin{array}{l} \hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out}) \\ \hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out}) + {M R (f_{+, ϵ}) - M R (f_{+, ϵ})} \\ M R (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out}) + [M R (f_{+, ϵ}) - \hat{M R} (f_{+, ϵ})] . \end{array}

(B.13)

B.3.3. Step 3: Probabilistically Bound the Error Term from Step 2.

Next we show that the bracketed term in Line B.13 is less than or equal to $Q_{out}$ with high probability. For $k \in R$ , let q_diference and q_ratio be defined as in Eqs B.6 and B.8. Let $q : R \to R$ be the function such that $q (k) = q_{ratio} (k, \frac{k}{\sqrt{2}})$ . Then

\begin{array}{l} Q_{out} = \frac{B_{switch}}{b_{orig}} - \frac{B_{switch} - B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}}{b_{orig} + B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}} \\ = q_{ratio} (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}, B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}) \\ = q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}) . \end{array}

Applying this relation below, we have

\begin{array}{l} P [M R (f_{+, ϵ}) - \hat{M R} (f_{+, ϵ}) > Q_{out}] \\ \leq P [| M R (f_{+, ϵ}) - \hat{M R} (f_{+, ϵ}) | > q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}})] \\ \leq P [{| {\hat{e}}_{orig} (f_{+, ϵ}) - e_{orig} (f_{+, ϵ}) | > B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}} \\ ⋃ {| {\hat{e}}_{s w i t c h} (f_{+, ϵ}) - e_{s w i t c h} (f_{+, ϵ}) | > B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}}] & from Lemma 24 \\ \leq P [| {\hat{e}}_{orig} (f_{+, ϵ}) - e_{orig} (f_{+, ϵ}) | > B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}] \\ + P [| {\hat{e}}_{s w i t c h} (f_{+, ϵ}) - e_{s w i t c h} (f_{+, ϵ}) | > B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}] & from the Union bound \end{array}

(B.14)

\begin{array}{l} \leq 2 \exp {- \frac{2 n}{{(B_{ind} - 0)}^{2}} {[B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}]}^{2}} \\ + 2 \exp {- \frac{n}{{(B_{ind} - 0)}^{2}} {[B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}]}^{2}} & from Hoeffding's bound for U-statistics \end{array}

(B.15)

= \frac{2 δ}{6} + \frac{2 δ}{6} = \frac{2 δ}{3} .

(B.16)

In Line B.15, above, recall that ${\hat{e}}_{orig} (f_{+, ϵ})$ and ${\hat{e}}_{switch} (f_{+, ϵ})$ are both U-statistics. Note that $E [{\hat{e}}_{switch} (f_{+, ϵ})] = e_{switch} (f_{+, ϵ})$ because ${\hat{e}}_{switch} (f_{+, ϵ})$ is an average of terms, and each term has expectation equal to e_switch(f_+,ϵ). For the same reason, $E [{\hat{e}}_{orig} (f_{+, ϵ})] = e_{orig} (f_{+, ϵ})$ . This allows us to apply Eq 5.7 of Hoeffding, 1963 (see also Eq 1 on page 201 of Serfling, 1980, in Theorem A) to obtain Line B.15.

Alternatively, if we instead define model reliance as MR_difference(f) = e_switch(f) − e_orig(f) (see Appendix A.5), define empirical model reliance as ${\hat{M R}}_{difference} (f) : = {\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)$ , and define

Q_{out, difference} : = (1 + \frac{1}{\sqrt{2}}) B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}} = q_{difference} (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}, B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{2 n}}),

then the same proof holds without Assumption 3 if we replace MR, $\hat{M R}, Q_{out}$ , respectively with MR_difference, ${\hat{M R}}_{difference}, Q_{out,difference}$ , and redefine $q : R \to R$ as the function $q (k) = q_{difference} (k, \frac{k}{\sqrt{2}})$ .

Eqs B.14–B.16 also hold if we replace ê_switch throughout with ê_divide, including in Assumption 3, since the same bound can be used for both ê_switch and ê_divide (Eq 5.7 of Hoeffding, 1963; see also Theorem A on page 201 of Serfling, 1980).

B.3.4. Step 4: Combine Results to Show Eq 4.2

Finally, we connect the above results to show Eq 4.2. We know from Eq B.12 that Eq B.10 holds with high probability. Eq B.10 implies Eq B.13, which bounds MCR₊(ϵ) = MR(f_+,ϵ) up to a bracketed residual term. We also know from Eq B.16 that, with high probability, the residual term in Eq B.13 is less than $Q_{o u t} = q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}})$ . Putting this together We can show Eq 4.2:

\begin{array}{l} P (M C R_{+} (ϵ) > {\hat{M C R}}_{+} (ϵ_{out}) + Q_{out}) \\ = P (M R (f_{+, ϵ}) > {\hat{M C R}}_{+} (ϵ_{out}) + Q_{out}) \\ \leq P [(\hat{M R} (f_{+, ϵ}) > {\hat{M C R}}_{+} (ϵ_{out})) ⋃ (M R (f_{+, ϵ}) - \hat{M R} (f_{+, ϵ}) > Q_{out})] & from Step 2 \\ \leq P [\hat{M R} (f_{+, ϵ}) > {\hat{M C R}}_{+} (ϵ_{out})] + P [M R (f_{+, ϵ}) - \hat{M R} (f_{+, ϵ}) > Q_{out}] \\ \leq \frac{δ}{3} + \frac{2 δ}{3} = δ . & from Steps 1 & 3 \end{array}

(B.17)

This completes the proof for Eq 4.2. For Eq 4.3 we can use the same approach, shown below for completeness. Analogous to Eq B.12, we have

P [\hat{M R} (f_{-, ϵ}) < {\hat{M C R}}_{-} (ϵ_{out})] \leq \frac{δ}{3} .

Analogous to Eq B.13, when $\hat{M R} (f_{-, ϵ}) \geq \hat{M R} (f_{-, ϵ_{out}})$ we have

\begin{array}{l} \hat{M R} (f_{-, ϵ}) \geq {\hat{M C R}}_{-} (ϵ_{out}) \\ \hat{M R} (f_{-, ϵ}) \geq {\hat{M C R}}_{-} (ϵ_{out}) + {M R (f_{-, ϵ}) - M R (f_{-, ϵ})} \\ \hat{M R} (f_{-, ϵ}) \geq {\hat{M C R}}_{-} (ϵ_{out}) - [\hat{M R} (f_{-, ϵ}) - M R (f_{-, ϵ})] . \end{array}

Analogous to Eq B.16, we have

P [\hat{M R} (f_{-, ϵ}) - M R (f_{-, ϵ}) > q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}})] \leq \frac{2 δ}{3} .

(B.18)

Finally, analogous to Eq B.17, we have

\begin{array}{l} P (M C R_{-} (ϵ) < {\hat{M C R}}_{-} (ϵ_{out}) - Q_{out}) \\ \leq P [(\hat{M R} (f_{-, ϵ}) < {\hat{M C R}}_{-} (ϵ_{out})) ⋃ (\hat{M R} (f_{-, ϵ}) - M R (f_{-, ϵ}) > q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}}))] \end{array}

(B.19)

\begin{array}{l} \leq P [\hat{M R} (f_{-, ϵ}) < {\hat{M C R}}_{-} (ϵ_{out})] + P [\hat{M R} (f_{-, ϵ}) - M R (f_{-, ϵ}) > q (B_{ind} \sqrt{\frac{\log (6 δ^{- 1})}{n}})] \\ \leq \frac{δ}{3} + \frac{2 δ}{3} = δ . \end{array}

Again, the same proof holds without Assumption 3 if we replace MR, $\hat{M R}, Q_{out}$ respectively with MR_difference, ${\hat{M R}}_{difference}, Q_{out,difference}$ , and redefine q as the function satisfying $q (k) = q_{difference} (k, \frac{k}{\sqrt{2}})$ in Eqs B.18 & B.19. ■

B.4. Proof of Corollary 22

Proof By definition, MR(f_{−,ϵ_best}) ≤ MR(f^⋆) ≤ MR(f_{+,ϵ_best}). Applying this relation in Line B.20, below, we see

\begin{array}{l} P (M R (f^{⋆}) \in [{\hat{M C R}}_{-} (ϵ_{best}) - Q_{best}, {\hat{M C R}}_{+} (ϵ_{best}) + Q_{best}]) \\ = 1 - P {M R (f^{⋆}) < {\hat{M C R}}_{-} (ϵ_{best}) - Q_{best} ⋃ M R (f^{⋆}) > {\hat{M C R}}_{+} (ϵ_{best}) + Q_{best}} \\ \geq 1 - P {M R (f^{⋆}) < {\hat{M C R}}_{-} (ϵ_{best}) - Q_{best}} - P {M R (f^{⋆}) > {\hat{M C R}}_{+} (ϵ_{best}) + Q_{best}} \\ \geq 1 - P {M R (f_{-, ϵ}) < {\hat{M C R}}_{-} (ϵ_{best}) - Q_{best}} - P {M R (f_{+, ϵ}) > {\hat{M C R}}_{+} (ϵ_{best}) + Q_{best}} \end{array}

(B.20)

\begin{array}{l} \geq 1 - \frac{δ}{2} - \frac{δ}{2} & from Theorem 4. \end{array}

(B.21)

To apply Theorem 4 in Line B.21, above, we note that $Q_{best}$ and ϵ_best are equivalent to the definitions of $Q_{out}$ and ϵ_out in Theorem 4, but with δ replaced by $\frac{δ}{2}$ .

Alternatively, if we define model reliance as MR_difference(f) = e_switch(f) − e_orig(f) (see Appendix A.5), and define empirical model reliance as ${\hat{M R}}_{difference} (f) = {\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)$ , then let

Q_{best, difference} : = (1 + \frac{1}{\sqrt{2}}) B_{ind} \sqrt{\frac{\log (12 δ^{- 1})}{n}} .

The term $Q_{best,difference}$ is equivalent to $Q_{out,difference}$ but with δ replaced with $\frac{δ}{2}$ . Under this difference-based definition of model reliance, Theorem 4 holds without Assumption 3 if we replace $Q_{out}$ with $Q_{out,difference}$ (see Section B.3), and so we can apply this altered version of Theorem 4 in Line B.21. Thus, Theorem 22 also holds without Assumption 3 if we replace MR, $\hat{M R}$ , and $Q_{best}$ respectively with MR_difference, ${\hat{M R}}_{difference}$ , and $Q_{best,difference}$ . ■

B.5. Proof of Theorems 5 & 6

We begin by proving Theorem 5, along with related results. We then apply these results to show Theorem 6.

B.5.1. Proof of Theorem 5, and Other Limits on Estimation Error, Based on Covering Number

The following theorem uses the covering number based on r-margin-expectation-covers to jointly bound empirical losses for any function $f \in F$ . Theorem 5 in the main text follows directly from Eq B.25, below.

Theorem 25 If Assumptions 1, 2 and 3 hold for all $f \in F$ , then for any r > 0

P_{D} [\sup_{f \in F} | {\hat{e}}_{orig} (f) - e_{orig} (f) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}} + 2 r] \leq δ,

(B.22)

P_{D} [\sup_{f \in F} | \hat{E} \tilde{L} (f, Z) - E \tilde{L} (f, Z) | > 2 B_{r e f} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}} + 2 r] \leq δ,

(B.23)

P_{D} [\sup_{f \in F} | {\hat{e}}_{switch} (f) - e_{switch} (f) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{n}} + 2 r] \leq δ,

(B.24)

P [\sup_{f \in F} | \frac{{\hat{e}}_{orig} (f)}{{\hat{e}}_{switch} (f)} - \frac{e_{orig} (f)}{e_{switch} (f)} | > Q_{4}] \leq δ,

(B.25)

P_{D} [\sup_{f \in F} | {{\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)} - {e_{switch} (f) - e_{orig} (f)} | > Q_{4, difference}] \leq δ,

(B.26)

where

Q_{4} : = q_{ratio} (B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}, B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r),

(B.27)

Q_{4, difference} : = q_{difference} (B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}, B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r),

(B.28)

and q_ratio and q_difference are defined as in Lemma 24. For Eq B.26, the result is unaffected if we remove Assumption 3.

B.5.2. Proof of Eq B.22

Proof Let $G_{r}$ be a r-margin-expectation-cover for $F$ of size $N (F, r)$ . Let D_p denote the population distribution, let D_s be the sample distribution, and let D^⋆ be the uniform mixture of D_p and D_s, i.e., for any $z \in Z$ ,

P_{D^{⋆}} (Z \leq z) = \frac{1}{2} P_{D_{p}} (Z \leq z) + \frac{1}{2} P_{D_{s}} (Z \leq z) .

(B.29)

Unless otherwise stated, we take expectations and probabilities with respect to D_p. Since $G_{r}$ is a r-margin-expectation-cover, we know that for any $f \in F$ we can find a function $g \in G_{r}$ such that $E_{D^{⋆}} | L (g, Z) - L (f, Z) | = E_{D^{⋆}} | \tilde{L} (g, Z) - \tilde{L} (f, Z) | \leq r,$ and

\begin{array}{l} | \hat{E} L (f, Z) - E L (f, Z) | = | \hat{E} L (f, Z) - E L (f, Z) + {\hat{E} L (g, Z) - \hat{E} L (g, Z)} + {E L (g, Z) - E L (g, Z)} | \end{array}

(B.30)

\begin{array}{l} \leq | \hat{E} L (g, Z) - E L (g, Z) | + | E L (g, Z) - E L (f, Z) | + | \hat{E} L (f, Z) - \hat{E} L (g, Z) | \\ \leq | \hat{E} L (g, Z) - E L (g, Z) | + E_{D_{p}} | L (g, Z) - L (f, Z) | + E_{D_{s}} | L (f, Z) - L (g, Z) | \\ = | \hat{E} L (g, Z) - E L (g, Z) | + 2 E_{D^{⋆}} | L (g, Z) - L (f, Z) | \\ \leq | \hat{E} L (g, Z) - E L (g, Z) | + 2 r . \end{array}

Applying the above relation in Line B.31 below, we have

\begin{array}{l} P (\sup_{f \in F} | \hat{E} L (f, Z) - E L (f, Z) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}} + 2 r) \\ = P (\exists f \in F : | \hat{E} L (f, Z) - E L (f, Z) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}} + 2 r) \\ \leq P (\exists g ϵ G_{r} : | \hat{E} L (g, Z) - E L (g, Z) | + 2 r > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}} + 2 r) \end{array}

(B.31)

\begin{array}{l} = P (⋃_{g ϵ G_{r}} | \hat{E} L (g, Z) - E L (g, Z) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}}) \\ \leq \sum_{g ϵ G_{r}} P (| \hat{E} L (g, Z) - E L (g, Z) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}}) & from the Union bound \\ \leq N (F, r) 2 \exp [- \frac{2 n}{{(B_{ind})}^{2}} {B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{2 n}}}^{2}] & from Hoeffding’s inequality \end{array}

(B.32)

\begin{array}{l} = δ . \end{array}

(B.33)

To apply Hoeffding’s inequality (Theorem 2 of Hoeffding, 1963) in Line B.32, above, we use the fact that L(g, Z) is bounded within an interval of length B_ind. ■

B.5.3. Proof of Eq B.23

Proof The proof for Eq B.23 is nearly identical to the proof for Eq B.22. Simply replacing L and B_ind respectively with $\tilde{L}$ and (2B_ref) in Eqs B.30–B.33 yields a valid proof for Eq B.23. ■

B.5.4. Proof of Eq B.24

Proof Let F_D denote the cumulative distribution function for a distribution D. Let ${\tilde{D}}_{s}$ be the distribution such that

F_{{\tilde{D}}_{p}} (Y = y, X_{1} = x_{1}, X_{2} = x_{2}) = F_{D_{p}} (Y = y, X_{2} = x_{2}) F_{D_{p}} (X_{1} = x_{1}) .

Let ${\tilde{D}}_{p}$ be the distribution satisfying

P_{{\tilde{D}}_{s}} (Y = y, X_{1} = x_{1}, X_{2} = x_{2}) = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq 1} 1 (y_{[j]} = y, X_{1 [i, \cdot]} = x_{1}, X_{2 [j, \cdot]} = x_{2}) .

Let ${\tilde{D}}^{⋆}$ be the uniform mixture of ${\tilde{D}}_{p}$ and ${\tilde{D}}_{s}$ , as in Eq B.29. Replacing e_orig, ê_orig, D_p, D_s, and ${\tilde{D}}^{⋆}$ respectively with e_switch, ê_switch, ${\tilde{D}}_{p}$ , ${\tilde{D}}_{s}$ , and ${\tilde{D}}^{⋆}$ , we can follow the same steps as in the proof for Eq B.22. For any $f \in F$ , we know that there exists a function $g \in G_{r}$ satisfying $E_{{\tilde{D}}^{⋆}} | L (g, Z) - L (f, Z) | \leq r,$ which implies

\begin{array}{l} | {\hat{e}}_{switch} (f) - e_{switch} (f) | = | {\hat{e}}_{switch} (f) - e_{switch} (f) + {{\hat{e}}_{switch} (g) - {\hat{e}}_{switch} (g)} + {e_{switch} (g) - e_{switch} (g)} | \\ \leq | {\hat{e}}_{switch} (g) - e_{switch} (g) | + E_{{\tilde{D}}_{p}} | L (g, Z) - L (f, Z) | + E_{{\tilde{D}}_{s}} | L (f, Z) - L (g, Z) | \\ = | {\hat{e}}_{switch} (g) - e_{switch} (g) | + 2 E_{{\tilde{D}}^{⋆}} | L (g, Z) - L (f, Z) | \\ \leq | {\hat{e}}_{switch} (g) - e_{switch} (g) | + 2 r . \end{array}

As a result,

\begin{array}{l} P (\sup_{f \in F} | {\hat{e}}_{switch} (f) - e_{switch} (f) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{n}} + 2 r) \\ \leq P (\exists g \in G_{r} : | {\hat{e}}_{switch} (g) - e_{switch} (g) | + 2 r > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{n}} + 2 r) \\ \leq \sum_{g \in G_{r}} P (| {\hat{e}}_{switch} (g) - e_{switch} (g) | > B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{n}}) \\ \leq N (F, r) 2 \exp [- \frac{n}{{(B_{ind} - 0)}^{2}} {B_{ind} \sqrt{\frac{\log (2 δ^{- 1} N (F, r))}{n}}}^{2}] \\ = δ . \end{array}

(B.34)

In Line B.34, above, we apply Eq 5.7 of Hoeffding, 1963 (see also Eq 1 on page 201 of Serfling, 1980, in Theorem A), in the same way as in Eq B.15. ■

B.5.5. Proof for Eq B.25

Proof We apply Lemma 24 and Eq B.27 in Line B.36, below, to obtain

\begin{array}{l} P [\sup_{f \in F} | \frac{{\hat{e}}_{orig} (f)}{{\hat{e}}_{switch} (f)} - \frac{e_{orig} (f)}{e_{switch} (f)} | > Q_{4}] \end{array}

(B.35)

\begin{array}{l} = P (\exists f \in F : | \frac{{\hat{e}}_{orig} (f)}{{\hat{e}}_{switch} (f)} - \frac{e_{orig} (f)}{e_{switch} (f)} | > Q_{4}) \end{array}

\begin{array}{l} \leq P ({\exists f \in F : | {\hat{e}}_{orig} (f) - e_{orig} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r} \end{array}

(B.36)

\begin{array}{l} ⋃ {\exists f \in F : | {\hat{e}}_{switch} (f) - e_{switch} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}}) \end{array}

\begin{array}{l} = P (\sup_{f \in F} | {\hat{e}}_{orig} (f) - e_{orig} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r) \\ + P (\sup_{f \in F} | {\hat{e}}_{switch} (f) - e_{switch} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}) \end{array}

\begin{array}{l} \leq \frac{δ}{2} + \frac{δ}{2} . & from Eqs B.22 and B.24 \end{array}

(B.37)

■

B.5.6. Proof for Eq B.26

Proof Finally, to show B.26, we apply the same steps as in Eqs B.35 through B.37. We apply Eq B.28 & Lemma 24 to obtain

\begin{array}{l} P [\sup_{f \in F} | {{\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)} - {e_{switch} (f) - e_{orig} (f)} | > Q_{4, difference}] \\ \leq P ({\exists f \in F : | {\hat{e}}_{orig} (f) - e_{orig} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r} \\ ⋃ {\exists f \in F : | {\hat{e}}_{switch} (f) - e_{switch} (f) | > B_{ind} \sqrt{\frac{\log (4 δ^{- 1} N (F, r \sqrt{2}))}{2 n}} + 2 r \sqrt{2}}) \\ \leq \frac{δ}{2} + \frac{δ}{2} . \end{array}

■

B.5.7. Implementing Theorem 25 to Show Theorem 6

Proof Consider the event that

\exists {\hat{f}}_{+, ϵ_{i n}} \in \underset{f \in \hat{R} (ϵ_{i n})}{arg max} \hat{M R} (f) such that M C R_{+} (ϵ) < M R ({\hat{f}}_{+, ϵ_{i n}}) .

(B.38)

A brief outline of our proof for Eq 4.6 is as follows. We expect Eq B.38 to be unlikely due to the fact that ϵ_in < ϵ. If Eq B.38 does not hold, then the only way that ${MCR}_{+} (ϵ) < {\hat{M C R}}_{+} (ϵ_{in}) - Q_{in}$ holds is if there exists ${\hat{f}}_{+, ϵ i n} \in {arg max}_{f \in \hat{R} (ϵ_{i n})} \hat{M R} (f)$ which has an empirical MR that differs from its population-level MR by at least $Q_{in}$ .

To show that Eq B.38 is unlikely, we apply Theorem 25:

\begin{array}{l} P (\exists {\hat{f}}_{+, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) : M C R_{+} (ϵ) < M R ({\hat{f}}_{+, ϵ_{in}})) \\ \leq P (\exists f \in \hat{R} (ϵ_{in}) : M C R_{+} (ϵ) < M R (f)) \\ = P (\exists f \in \hat{R} (ϵ_{in}) \ f_{ref} : M C R_{+} (ϵ) < M R (f)) & by M C R_{+} (ϵ) \geq M R (f_{ref}) \\ \leq P (\exists f \in \hat{R} (ϵ_{in}) \ f_{ref} : E \tilde{L} (f, Z) > ϵ) & by M C R_{+} (ϵ) Def \\ = P (\exists f \in F, E \tilde{L} (f, Z) > ϵ : \hat{E} \tilde{L} (f, Z) \leq ϵ_{in}) & by \hat{R} (ϵ) Def \\ = P (\exists f \in F, E \tilde{L} (f, Z) > ϵ : \\ \hat{E} \tilde{L} (f, Z) - ϵ \leq - 2 B_{ref} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} - 2 r) & by ϵ_{in} Def \end{array}

(B.39)

\begin{array}{l} \leq P (\exists f \in F, E \tilde{L} (f, Z) > ϵ : \\ \hat{E} \tilde{L} (f, Z) - E \tilde{L} (f, Z) \leq - 2 B_{ref} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} - 2 r) \\ \leq P (\sup_{f \in F} | \hat{E} \tilde{L} (f, Z) - E \tilde{L} (f, Z) | \geq 2 B_{ref} \sqrt{\frac{\log (4 δ^{- 1} N (F, r))}{2 n}} + 2 r) \\ = \frac{δ}{2} & by Thm 25. \end{array}

(B.40)

If Eq B.38 does not hold, we have

\begin{array}{l} M C R_{+} (ϵ) \geq M R ({\hat{f}}_{+, ϵ_{in}}) & for all {\hat{f}}_{+, ϵ_{in}} \in \underset{f ϵ \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) \\ = \hat{M R} ({\hat{f}}_{+, ϵ_{in}}) - {\hat{M R} ({\hat{f}}_{+, ϵ_{in}}) - M R ({\hat{f}}_{+, ϵ_{in}})} & for all {\hat{f}}_{+, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) \\ = {\hat{M C R}}_{+} (ϵ_{in}) - {\hat{M R} ({\hat{f}}_{+, ϵ_{in}}) - M R ({\hat{f}}_{+, ϵ_{in}})} & for all {\hat{f}}_{+, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) \\ \geq {\hat{M C R}}_{+} (ϵ_{in}) - \sup_{f \in F} | \hat{M R} (f) - M R (f) | . \end{array}

(B.41)

Let q_ratio and q_difference be defined as in Lemma 24. Then

\begin{array}{l} Q_{in} = \frac{B_{switch}}{b_{orig}} - \frac{B_{switch} - {B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}}}{b_{orig} + {B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r))}{2 n}} + 2 r}} \\ = q_{ratio} (B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}, B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r))}{2 n}} + 2 r) \end{array}

(B.42)

Theorem 25 implies that the sup term in Eq B.41 is less than $Q_{in}$ with probability at least $1 - \frac{δ}{2}$ . Now, examining the left-hand side of Eq 4.6, we see

\begin{array}{l} P (M C R_{+} (ϵ) < {\hat{M C R}}_{+} (ϵ_{in}) - Q_{in}) \\ \leq P [{\exists {\hat{f}}_{+, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) : M C R_{+} (ϵ) < M R ({\hat{f}}_{+, ϵ_{in}})} \\ ⋃ {\sup_{f \in F} | \hat{M R} (f) - M R (f) | > Q_{in}}] & from Eq B.41 \\ \leq P [\exists {\hat{f}}_{+, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg max} \hat{M R} (f) : M C R_{+} (ϵ) < M R ({\hat{f}}_{+, ϵ_{in}})] \\ + P [\sup_{f \in F} | \hat{M R} (f) - M R (f) | > Q_{in}] & from the Union bound \\ = \frac{δ}{2} + \frac{δ}{2} & from Eq B.40, Eq B.42, & Theorem 25. \end{array}

(B.43)

This completes the proof for Eq 4.6.

Alternatively, if we have defined model reliance as MR(f) = e_switch(f) − e_orig(f) (see Appendix A.5), with $\hat{M R} (f) = {\hat{e}}_{switch} (f) - {\hat{e}}_{orig} (f)$ , and

\begin{array}{l} Q_{in, difference} = B_{ind} {\sqrt{\frac{\log (8 δ^{- 1} N (F, r \sqrt{2}))}{n}} + \sqrt{\frac{\log (8 δ^{- 1} N (F, r))}{2 n}}} + 2 r (\sqrt{2} + 1) \\ = q_{difference} (B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r \sqrt{2}))}{n}} + 2 r \sqrt{2}, B_{ind} \sqrt{\frac{\log (8 δ^{- 1} N (F, r))}{2 n}} + 2 r), \end{array}

then same proof of Eq 4.6 holds without Assumption 3 if we replace $Q_{in}$ with $Q_{in, difference}$ , and apply Eq B.26 in Eq B.43.

For Eq 4.7 we can use the same approach. Consider the event that

\exists {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) : M C R_{-} (ϵ) > M R ({\hat{f}}_{-, ϵ_{in}}) .

(B.44)

Applying steps analogous to those used to derive Eq B.40, we have

\begin{array}{l} P (\exists {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) : M C R_{-} (ϵ) > M R ({\hat{f}}_{-, ϵ_{in}})) \\ \leq P (\exists f \in F, E \tilde{L} (f, Z) > ϵ : \hat{E} \tilde{L} (f, Z) \leq ϵ_{in}) \leq \frac{δ}{2} . \end{array}

Analogous to B.41, when Eq B.44 does not hold, we have have

\begin{array}{l} M C R_{-} (ϵ) \leq M R ({\hat{f}}_{-, ϵ_{in}}) & for all {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) \\ = \hat{M R} ({\hat{f}}_{-, ϵ_{in}}) + {M R ({\hat{f}}_{-, ϵ_{in}}) - \hat{M R} ({\hat{f}}_{-, ϵ_{in}})} & for all {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) \\ = {\hat{M C R}}_{-} (ϵ_{in}) + {M R ({\hat{f}}_{-, ϵ_{in}}) - \hat{M R} ({\hat{f}}_{-, ϵ_{in}})} & for all {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) \\ \leq {\hat{M C R}}_{-} (ϵ_{in}) + \sup_{f \in F} | M R (f) - \hat{M R} (f) | \end{array}

Finally, analogous to Eq B.43,

\begin{array}{l} P (M C R_{-} (ϵ) > \hat{M C R} (ϵ_{in}) + Q_{in}) \\ \leq P [{\exists {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) : M C R_{-} (ϵ) > M R ({\hat{f}}_{-, ϵ_{in}})} \\ ⋃ {\sup_{f \in F} | \hat{M R} (f) - M R (f) | > Q_{in}}] \\ \leq P [\exists {\hat{f}}_{-, ϵ_{in}} \in \underset{f \in \hat{R} (ϵ_{in})}{arg min} \hat{M R} (f) : M C R_{-} (ϵ) > M R ({\hat{f}}_{-, ϵ_{in}})] \\ + P [\sup_{f \in F} | \hat{M R} (f) - M R (f) | > Q_{in}] \\ = \frac{δ}{2} + \frac{δ}{2} . \end{array}

(B.45)

Under the difference-based definition of model reliance (see Appendix A.5), the same proof for Eq 4.7 holds without Assumption 3 if we replace MR, $\hat{M R}$ , & $Q_{in}$ respectively with MR_difference, ${\hat{M R}}_{difference}$ , & $Q_{in,difference}$ , and apply Eq B.26 in Eq B.45. ■

B.6. Proof of Proposition 7, and Corollary for a Unique Best-in-class Model.

We first introduce a lemma to describe the performance of any individual model in the population ϵ-Rashomon set.

Lemma 26 Let $ϵ_{1}^{'} : = 2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}$ , and let the functions ${\hat{ϕ}}_{-}$ and ${\hat{ϕ}}_{+}$ be defined as in Proposition 7. Given a function $f_{1} \in R (ϵ)$ , if Assumption 2 holds for f₁, then

P {ϕ (f_{1}) \in [\hat{ϕ} - (ϵ_{1}^{'}), \hat{ϕ} + (ϵ_{1}^{'})]} \geq 1 - δ .

Proof Consider the event that

ϕ (f_{1}) \in [\hat{ϕ} - (ϵ_{1}^{'}), \hat{ϕ} + (ϵ_{1}^{'})] .

(B.46)

Eq B.46 will always hold if $f_{1} \in \hat{R} (ϵ_{1}^{'})$ , since the interval $[\hat{ϕ} - (ϵ_{1}^{'}), \hat{ϕ} + (ϵ_{1}^{'})]$ contains ϕ(f) for any $f \in \hat{R} (ϵ_{1}^{'})$ by definition. Thus,

\begin{array}{l} P {ϕ (f_{1}) \notin [[\hat{ϕ} - (ϵ_{1}^{'}), {\hat{ϕ}}_{+} (ϵ_{1}^{'})]} \leq P {f_{1} \notin \hat{R} (ϵ_{1}^{'})} \\ \leq δ & from Lemma 23. \end{array}

■

B.6.1. Proof of Proposition 7

Proof Let $f_{-, ϵ, ϕ} \in {arg min}_{f \in R (ϵ)} ϕ (f)$ and $f_{+, ϵ, ϕ} \in {arg min}_{f \in R (ϵ)} ϕ (f)$ respectively denote functions that attain the lowest and highest values of ϕ(f) among models $f \in R (ϵ)$ . Applying the definitions of f_−,ϵ,ϕ and f_+,ϵ,ϕ in Line B.47, below, we have

\begin{array}{l} P ({ϕ (f) : f \in R (ϵ)} ⊄ [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) \\ = P ([ϕ (f_{-, ϵ, ϕ}), ϕ (f_{+, ϵ, ϕ})] ⊄ [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) \\ = P (ϕ (f_{-, ϵ, ϕ}) \notin [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) ⋃ ϕ (f_{+, ϵ, ϕ}) \notin [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) \\ \leq P (ϕ (f_{-, ϵ, ϕ}) \notin [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) + P (ϕ (f_{-, ϵ, ϕ}) \notin [{\hat{ϕ}}_{-} (ϵ^{'}), {\hat{ϕ}}_{+} (ϵ^{'})]) \\ \leq \frac{δ}{2} + \frac{δ}{2} = δ from Lemma 26, and the definition of ϵ^{'} = ϵ + 2 B_{ref} \sqrt{\frac{\log (2 δ^{- 1})}{2 n}} . \end{array}

(B.47)

■

B.6.2. Corollary for a Unique Best-in-Class Model

When the best-in-class model is unique, it can be described by the corollary below.

Corollary 27 Let ${\hat{ϕ}}_{-} (ϵ_{0}^{'}) : = \min_{f \in \hat{R} (ϵ_{1}^{'})} ϕ (f)$ and ${\hat{ϕ}}_{+} (ϵ_{1}^{'}) : = \max_{f \in \hat{R} (ϵ_{1}^{'})} ϕ (f)$ , where $ϵ_{0}^{'} : = 2 B_{r e f} \sqrt{\frac{\log (δ^{- 1})}{2 n}}$ . Let $f^{⋆} \in {arg min}_{f \in F} e_{orig} (f)$ be the prediction model that uniquely attains the lowest possible expected loss. If f^⋆ satisfies Assumption 2, then

P {ϕ (f^{⋆}) \in [{\hat{ϕ}}_{-} (ϵ_{1}^{'}), {\hat{ϕ}}_{+} (ϵ_{1}^{'})]} \geq 1 - δ .

Proof Since $f^{⋆} \in R (0)$ , Corollary 27 follows immediately from Lemma 26.

Notice that by assuming f^⋆ is unique, we can use the threshold $ϵ_{0}^{'} : = 2 B_{ref} \sqrt{\frac{\log (δ^{- 1})}{2 n}}$ , which is lower than the threshold of $ϵ^{'} = ϵ + 2 B_{ref} \sqrt{\frac{\log (2 δ^{- 1})}{2 n}}$ with ϵ = 0, as in Proposition 7. In this way, assuming uniqueness allows a stronger statement than the one in Proposition 7. ■

B.7. Absolute Losses versus Relative Losses in the Definition of the Rashomon Set

In this paper we primarily define Rashomon sets as the models that perform well relative to a reference model f_ref. We can also study an alternate formulation of Rashomon sets by replacing the relative loss $\tilde{L}$ with the non-standardized loss L throughout. This results in a new interpretation of the Rashomon set $R (ϵ_{abs}, f_{ref}, F) = {f_{ref}} \cup {f \in F : E L (f, Z) \leq ϵ_{abs}}$ as the union of f_ref and the subset of models with absolute loss L no higher than ϵ_abs, for ϵ_abs > 0. The process of computing empirical MCR is largely unaffected by whether L or $\tilde{L}$ is used, as it is simple to transform from one optimization problem to the other.

We still require the explicit inclusion of f_ref in empirical and population Rashomon sets to ensure that they are nonempty. However, in many cases, this inclusion becomes redundant when interpreting a Rashomon set (e.g., when ϵ ≥ 0, and $E L (f_{ref}, Z) \leq ϵ_{abs}}$ .

Under the replacement of $\tilde{L}$ with L, we also replace Assumption 2 with Assumption 1 (whenever this is not redundant), and replace 2B_ref with B_ind in the definitions of ϵ_out, ϵ_best, ϵ_in, ϵ′ and $ϵ_{1}^{'}$ in Theorem 4, Corollary 22, Theorem 6, Proposition 7, and Corollary 27. This is because the motivation for the 2B_ref term is that $\tilde{L} (f_{1})$ is bounded within an interval of length 2B_ref when f₁ satisfies Assumption 2. However, under Assumption 1, L(f₁) is bounded within an interval of length B_ind.

B.8. Proof of Proposition 15

Proof To show Eq 7.1 we start with e_orig(f_β),

\begin{array}{l} e_{orig} (f_{β}) = E [{Y - X_{1}^{'} β_{1} - X_{2}^{'} β_{2}}^{2}] \\ = E [{(Y - X_{2}^{'} β_{2}) - X_{1}^{'} β_{1}}^{2}] \\ = E [{(Y - X_{2}^{'} β_{2})}^{2}] - 2 E [(Y - X_{2}^{'} β_{2}) X_{1}^{'}] β_{1} + {β^{'}}_{1} E [X_{1} X_{1}^{'}] β_{1} . \end{array}

For e_switch(f_β), we can follow the same steps as above:

\begin{array}{l} e_{switch} (f_{β}) = E_{Y^{(b)}, X_{1}^{(a)}, X_{2}^{(b)}} [{Y^{(b)} - X_{1}^{{(a)}^{'}} β_{1} - X_{2}^{{(b)}^{'}} β_{2}}^{2}] \\ = E [{(Y^{(b)} - X_{2}^{{(b)}^{'}} β_{2})}^{2}] - 2 E [{(Y^{(b)} - X_{2}^{{(b)}^{'}} β_{2})}^{2}] E [X_{1}^{{(a)}^{'}}] β_{1} + {β^{'}}_{1} E [X_{1}^{(a)} X_{1}^{{(a)}^{'}}] β_{1} . \end{array}

Since $(Y^{(b)}, X_{1}^{(b)}, X_{2}^{(b)})$ and $(Y^{(a)}, X_{1}^{(a)}, X_{2}^{(a)})$ each have the same distribution as (Y, X₁, X₂), we can omit the superscript notation to show Eq 7.1:

\begin{array}{l} e_{switch} (f_{β}) = E [{(Y - X_{2}^{'} β_{2})}^{2}] - 2 E [Y - X_{2}^{'} β_{2}] E [X_{1}^{'}] β_{1} + β_{1}^{'} E [X_{1} X_{1}^{'}] β_{1} \\ e_{switch} (f_{β}) = e_{orig} (f_{β}) - 2 E [Y - X_{2}^{'} β_{2}] E [X_{1}^{'}] β_{1} + 2 E [(Y - X_{2}^{'} β_{2}) X_{1}^{'}] β_{1} \\ e_{switch} (f_{β}) = e_{orig} (f_{β}) + 2 Cov (Y - X_{2}^{'} β_{2}, X_{1}) β_{1} \\ e_{switch} (f_{β}) = e_{orig} (f_{β}) + 2 Cov (Y, X_{1}) β_{1} - 2 β_{2} Cov (X_{2}, X_{1}) β_{1} . \end{array}

Dividing both sides by e_orig(f_β) gives the desired result.

Next, we can use a similar approach to show Eq 7.2:

\begin{array}{l} {\hat{e}}_{switch} (f_{β}) = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq 1} {(y_{[j]} - X_{2 [j, \cdot]} β_{2} - X_{1 [i, \cdot]} β_{1})}^{2} \\ n (n - 1) {\hat{e}}_{switch} (f_{β}) = \sum_{i = 1}^{n} \sum_{j \neq i} {{(y_{[j]} - X_{2 [j, \cdot]} β_{2})}^{2} - 2 (y_{[j]} - X_{2 [j, \cdot]} β_{2}) (X_{1 [i, \cdot]} β_{1}) + {(X_{1 [i, \cdot]} β_{1})}^{2}} \\ = (n - 1) {\sum_{i = 1}^{n} (y_{[i]} - X_{2 [i, \cdot]} β_{2})}^{2} \\ - 2 {\sum_{i = 1}^{n} \sum_{j \neq i} (X_{1 [i, \cdot]} β_{1}) (y_{[j]} - X_{2 [j, \cdot]} β_{2})} + (n - 1) {\sum_{i = 1}^{n} (X_{1 [i, \cdot]} β_{1})}^{2} . \end{array}

(B.48)

Focusing on the term in braces,

\begin{array}{l} \sum_{i = 1}^{n} \sum_{j \neq i} (X_{1 [i, \cdot]} β_{1}) (y_{[j]} - X_{2 [j, \cdot]} β_{2}) \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} (X_{1 [i, \cdot]} β_{1}) (y_{[j]} - X_{2 [j, \cdot]} β_{2}) - \sum_{i = 1}^{n} (X_{1 [i, \cdot]} β_{1}) (y_{[j]} - X_{2 [j, \cdot]} β_{2}) \\ = \sum_{i = 1}^{n} (X_{1 [i, \cdot]} β_{1}) \sum_{j = 1}^{n} (y_{[j]} - X_{2 [j, \cdot]} β_{2}) - \sum_{i = 1}^{n} (X_{1 [i, \cdot]} β_{1}) (y_{[j]} - X_{2 [j, \cdot]} β_{2}) \\ = {{(X_{1} β_{1})}^{'} 1_{n}} {1_{n}^{'} (y - X_{2} β_{2})} - {(X_{1} β_{1})}^{'} (y - X_{2} β_{2}) \\ = {(X_{1} β_{1})}^{'} (1_{n} 1_{n}^{'} - I_{n}) (y - X_{2} β_{2}) . \end{array}

(B.49)

Plugging this into Eq B.48, and applying the sample linear algebra representations as in Eq B.49, we get

\begin{array}{l} n (n - 1) {\hat{e}}_{switch} (f_{β}) = (n - 1) {‖ y - X_{2} β_{2} ‖}_{2}^{2} \\ - 2 {(X_{1} β_{1})}^{'} (1_{n} 1_{n}^{'} - I_{n}) (y - X_{2} β_{2}) \\ + (n - 1) {‖ X_{1} β_{1} ‖}_{2}^{2} \\ n {\hat{e}}_{switch} (f_{β}) = {‖ y - X_{2} β_{2} ‖}_{2}^{2} \\ - 2 {(X_{1} β_{1})}^{'} W (y - X_{2} β_{2}) \\ + {‖ X_{1} β_{1} ‖}_{2}^{2} \\ = y^{'} y - 2 y^{'} X_{2} β_{2} + β_{2}^{'} X_{2}^{'} X_{2} β_{2} \\ - 2 β_{1}^{'} X_{1}^{'} W_{y} + 2 β_{1}^{'} X_{1}^{'} W X_{2} β_{2} \\ + β_{1}^{'} X_{1}^{'} X_{1} β_{1} \\ = y^{'} y - 2 {[\begin{matrix} X_{1}^{'} W_{y} \\ X_{2}^{'} y \end{matrix}]}^{'} β + β^{'} [\begin{matrix} X_{1}^{'} X_{1} & X_{1}^{'} W X_{2} \\ X_{2}^{'} W X_{1} & X_{2}^{'} X_{2} \end{matrix}] β . \end{array}

■

B.9. Proof of Proposition 19

Proof First we consider e_orig(f₀). We briefly recall that the notation f₀(t, c) refers to the true conditional expectation function for both potential outcomes Y₁, Y₀, rather than the expectation for Y₀ alone.

Under the assumption that (Y₁, Y₀) ⊥ T|C, we have $f_{0} (t, c) = E (Y | C = c, T = t) = E (Y_{t} | C = c)$ . Applying this, we see that

\begin{array}{l} e_{orig} (f_{0}) = E L (f_{0}, (Y, T, C)) \\ = E L (f_{0}, (Y_{T}, T, C)) \\ = E_{T} E_{C | T} E_{Y_{T} | C} [{Y_{T} - E (Y_{T} | C)}^{2}] \\ = E_{T} E_{C | T} Var (Y_{T} | C) \\ = q E_{C | T = 0} Var (Y_{0} | C) + p E_{C | T = 1} Var (Y_{1} | C), \end{array}

(B.50)

where $p : = P (T = 1)$ and $q : = P (T = 0)$ .

Now we consider e_switch(f₀). Let $(Y_{0}^{(a)}, Y_{1}^{(a)}, T^{(a)}, C^{(a)})$ and $(Y_{0}^{(b)}, Y_{1}^{(b)}, T^{(b)}, C^{(b)})$ be a pair of independent random variable vectors, each with the same distribution as (Y₀, Y₁, T, C). Then

\begin{array}{l} e_{switch} (f_{0}) = E_{T^{(b)}, T^{(a)}, C^{(b)}, Y_{T^{(b)}}^{(b)}} [{Y_{T^{(a)}}^{(b)} - f_{0} (T^{(a)}, C^{(b)})}^{2}] \\ = E_{T^{(b)}, T^{(a)}, C^{(b)}, Y_{T^{(b)}}^{(b)}} [{Y_{T^{(b)}}^{(b)} - E (Y_{T^{(a)}} | C = C^{(b)})}^{2}] \\ = E_{T^{(b)}, T^{(a)}} E_{C^{(b)} | T^{(b)}} E_{Y_{T^{(b)}}^{(b)} | C^{(b)}} [{Y_{T^{(b)}}^{(b)} - E (Y_{T^{(a)}} | C = C^{(b)})}^{2}] . \end{array}

First we expand the outermost expectation, over T^(b), T^(a):

\begin{array}{l} e_{switch} (f_{0}) \\ = \sum_{i, i ϵ {0, 1}} P (T^{(b)} = i, T^{(a)} = j) E_{C^{(b)} | T^{(b)} = i} E_{Y_{i}^{(b)} | C^{(b)}} [{Y_{i}^{(b)} - E (Y_{j} | C = C^{(b)})}^{2}] . \end{array}

(B.51)

Since T^(b) ⊥ T^(a), we can write

\begin{array}{l} P (T^{(b)} = i, T^{(a)} = j) = P (T^{(b)} = i) P (T^{(a)} = j) \\ = p^{i + j} q^{2 - i - j} . \end{array}

Plugging this into Eq B.51 we get

e_{switch} (f_{0}) = \sum_{i, j ϵ {0, 1}} p^{i + j} q^{2 - i - j} E_{C^{(b)} | T^{(b)} = i} E_{Y_{i}^{(b)} | C^{(b)}} [{Y_{i}^{(b)} - E (Y_{j} | C = C^{(b)})}^{2}] .

Since $(Y_{0}^{(b)}, Y_{1}^{(b)}, C^{(b)}, T^{(b)})$ are the only random variables remaining, we can omit the superscript notation (e.g., replace C^(b) with C) to get

\begin{array}{l} e_{switch} (f_{0}) = \sum_{i, j ϵ {0, 1}} p^{i + j} q^{2 - i - j} E_{C | T = i} E_{Y_{i} | C} [{Y_{i} - E (Y_{i} | C)}^{2}] \\ = : \sum_{i, j ϵ {0, 1}} A_{i j}, \end{array}

where $A_{i j} = p^{i + j} q^{2 - i - j} E_{C | T = i} E_{Y_{i} | C} [{Y_{i} - E (Y_{j} | C)}^{2}]$ . First, we consider A₀₀ and A₁₁:

\begin{array}{l} A_{00} = q^{2} E_{C | T = 0} E_{Y_{0} | C} [{Y_{0} - E (Y_{0} | C)}^{2}] \\ = q^{2} E_{C | T = 0} Var (Y_{0} | C), \end{array}

and, similarly, $A_{11} = p^{2} E_{C | T = 1} Var (Y_{1} | C)$ .

Next we consider A₀₁ and A₁₀:

\begin{array}{l} A_{01} : = p q E_{C | T = 0} E_{Y_{0} | C} [{Y_{0} - E (Y_{1} | C)}^{2}] \\ = p q E_{C | T = 0} (E (Y_{0}^{2} | C) - 2 E (Y_{0} | C) E (Y_{1} | C) + E {(Y_{1} | C)}^{2}) \\ = p q E_{C | T = 0} (Var (Y_{0} | C) + E {(Y_{0} | C)}^{2} - 2 E (Y_{0} | C) E (Y_{1} | C) + E {(Y_{1} | C)}^{2}) \\ = p q E_{C | T = 0} (Var (Y_{0} | C) + {[E (Y_{1} | C) - E (Y_{0} | C)]}^{2}) \\ = p q E_{C | T = 0} (Var (Y_{0} | C) + CATE {(C)}^{2}), \end{array}

and, following the same steps,

A_{10} = p q E_{C | T = 1} (Var (Y_{1} | C) + CATE {(C)}^{2}) .

Plugging in A₀₀, A₀₁, A₁₀, and A₁₁ we get

\begin{array}{l} e_{switch} (f_{0}) = {A_{00} + A_{11}} \\ + [A_{01} + A_{10}] \\ = {q^{2} E_{C | T = 0} Var (Y_{0} | C) + p^{2} E_{C | T = 1} Var (Y_{1} | C)} \\ + [p q E_{C | T = 0} (Var (Y_{0} | C) + CATE {(C)}^{2}) + p q E_{C | T = 1} (Var (Y_{0} | C) + CATE {(C)}^{2})] \end{array}

\begin{array}{l} = {q (q + p) E_{C | T = 0} Var (Y_{0} | C) + p (p + q) E_{C | T = 1} Var (Y_{1} | C)} \end{array}

(B.52)

\begin{array}{l} + p q [E_{C | T = 0} (CATE {(C)}^{2}) + E_{C | T = 1} (CATE {(C)}^{2})] \end{array}

(B.53)

\begin{array}{l} = {e_{orig} (f_{0})} \end{array}

(B.54)

\begin{array}{l} + Var (T) [E_{C | T = 0} (CATE {(C)}^{2}) + E_{C | T = 1} (CATE {(C)}^{2})] . \end{array}

(B.55)

In Lines B.52 and B.53, we consolidate terms involving $E_{C | T = 0} Var (Y_{0} | C)$ and $E_{C | T = 1} Var (Y_{1} | C)$ . In Line B.54, we use p+q = 1 to reduce Line B.52 to the right-hand side of Eq B.50. Finally, in Line B.55, we use qp = Var(T). Dividing both sides by $e_{orig} (f_{0}) = E_{T, C} Var (Y | T, C)$ gives the desired result. ■

Appendix C. Proofs for Computational Results

Almost all of the proofs in this section are unchanged if we replace ê_switch (f) with ê_divide(f) in our definitions of ĥ_−,γ, ĥ_+,γ, ĝ_−,γ, ĝ_+,γ, and $\hat{M R}$ . The only exception is in Appendix C.3.

Throughout the following proofs, we will make use of the fact that, for constants $a, b, c, d \in R$ satisfying a ≥ c, the relation a + b ≤ c + d implies

\begin{array}{l} a + b \leq c + d \\ a - c \leq d - b \\ 0 \leq d - b since 0 \leq a - c \\ b \leq d . \end{array}

(C.1)

We also make use of the fact that for any $γ_{1}, γ_{2} \in R$ , the definitions of ĝ_+,γ₁ and ĝ_−,γ₁ imply

{\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) \leq {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{2}}), and {\hat{h}}_{-, γ_{1}} (g_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{1}} (g_{-, γ_{2}}) .

(C.2)

Finally, for any two values $γ_{1}, γ_{2} \in R$ , we make use of the fact that

\begin{array}{l} {\hat{h}}_{+, γ_{1}} (f) = {\hat{e}}_{orig} (f) + γ_{1} {\hat{e}}_{switch} (f) \\ = {\hat{e}}_{orig} (f) + γ_{2} {\hat{e}}_{switch} (f) + {γ_{1} {\hat{e}}_{switch} (f) - γ_{2} {\hat{e}}_{switch} (f)} \\ = {\hat{h}}_{+, γ_{2}} (f) + (γ_{1} - γ_{2}) {\hat{e}}_{switch} (f), \end{array}

(C.3)

and, by the same steps,

{\hat{h}}_{-, γ_{1}} (f) = {\hat{h}}_{-, γ_{2}} (f) + (γ_{1} - γ_{2}) {\hat{e}}_{orig} (f) .

(C.4)

C.1. Proof of Lemma 9 (Lower Bound for MR)

Proof We prove Lemma 9 in 2 parts.

C.1.1. Part 1: Showing Eq 6.1 Holds for All $f \in F$ Satisfying ê_orig(f) ≤ ϵ_abs.

If ĥ_−,γ (ĝ_−,γ) ≥ 0, then for any function $f \in F$ satisfying ê_orig(f) ≤ ϵ_abs we know that

\begin{matrix} \frac{1}{ϵ_{abs}} \leq \frac{1}{{\hat{e}}_{orig} (f)} \\ \frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} \leq \frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} (f)} . \end{matrix}

(C.5)

Now, for any $f \in F$ satisfying ê_orig(f) ≤ ϵ_abs, the definition of ĝ_−,γ implies that

\begin{array}{l} {\hat{h}}_{-, γ} (f) \geq {\hat{h}}_{-, γ} ({\hat{g}}_{-, γ}) \\ γ {\hat{e}}_{orig} (f) + {\hat{e}}_{switch} (f) \geq {\hat{h}}_{-, γ} ({\hat{g}}_{-, γ}) \\ γ + \frac{{\hat{e}}_{switch} (f)}{{\hat{e}}_{orig} (f)} \geq \frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} (f)} \\ γ + \frac{{\hat{e}}_{switch} (f)}{{\hat{e}}_{orig} (f)} \geq \frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} & from Eq C.5 \\ \hat{M R} (f) \geq \frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} - γ . \end{array}

C.1.2. Part 2: Showing that, if $f = {\hat{g}}_{-, γ}$ , and at Least One of the Inequalities in Condition 8 Holds with Equality, then Eq 6.1 Holds with Equality.

We consider each of the two inequalities in Condition 8 separately. If ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ}) = 0$ , then

\begin{array}{l} 0 = γ {\hat{e}}_{orig} ({\hat{g}}_{-, γ}) + {\hat{e}}_{switch} ({\hat{g}}_{-, γ}) \\ \frac{- {\hat{e}}_{switch} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ})} = γ . \end{array}

As a result

\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} - γ = \frac{0}{ϵ_{abs}} - {\frac{- {\hat{e}}_{switch} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ})}} = \hat{MR} ({\hat{g}}_{-, γ}) .

Alternatively, if ${\hat{e}}_{orig} ({\hat{g}}_{-, γ}) = ϵ_{abs}$ , then

\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{abs}} - γ = \frac{γ {\hat{e}}_{orig} ({\hat{g}}_{-, γ}) + {\hat{e}}_{switch} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ})} - γ = γ + \frac{{\hat{e}}_{switch} ({\hat{g}}_{-, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ})} - γ = \hat{MR} ({\hat{g}}_{-, γ}) .

■

C.2. Proof of Lemma 10 (Monotonicity for MR Lower Bound Binary Search)

Proof We prove Lemma 10 in 3 parts.

C.2.1. Part 1: ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})$ is Monotonically Increasing in γ.

Let $γ_{1}, γ_{2} \in R$ satisfy γ₁ < γ₂. We have assumed that $0 < {\hat{e}}_{orig} (f)$ for any $f \in F$ . Thus, for any $f \in F$ we have

\begin{matrix} γ_{1} {\hat{e}}_{orig} (f) + {\hat{e}}_{switch} (f) < γ_{2} {\hat{e}}_{orig} (f) + {\hat{e}}_{switch} (f) \\ \hat{h} -, γ_{1} (f) < \hat{h} -, γ_{2} (f) . \end{matrix}

(C.6)

Applying this, we have

\begin{array}{l} {\hat{h}}_{-}, γ_{1} ({\hat{g}}_{-}, γ_{1}) \leq {\hat{h}}_{-}, γ_{1} ({\hat{g}}_{-}, γ_{2}) & from Eq C.2 \\ \leq {\hat{h}}_{-}, γ_{2} ({\hat{g}}_{-}, γ_{2}) & from Eq C.6. \end{array}

This result is analogous to Lemma 3 from Dinkelbach (1967).

C.2.2. Part 2: ${\hat{e}}_{ORIG} ({\hat{g}}_{-, γ})$ is Monotonically Decreasing in γ.

Let $γ_{1}, γ_{2} \in R$ satisfy γ₁ < γ₂. Then

\begin{matrix} {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{2}}) & from Eq C.2 \\ {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{1}}) + (γ_{1} - γ_{2}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) + (γ_{1} - γ_{2}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) & from Eq C.4 \\ (γ_{1} - γ_{2}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \leq (γ_{1} - γ_{2}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) & from Eqs C.1 & C.2 \\ {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \geq {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) . \end{matrix}

C.2.3. Part 3: ${\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{ABS}} - γ}$ is Monotonically Decreasing in γ in the Range Where ${\hat{e}}_{ORIG} ({\hat{g}}_{-, γ}) \leq ϵ_{ABS}$ , and Increasing Otherwise.

Suppose $γ_{1} < γ_{2}$ and ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}})$ , ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) \leq ϵ_{abs}$ . Then, from Eq C.2,

\begin{array}{l} {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{1}}) \\ {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) \leq {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) + (γ_{2} - γ_{1}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) & from Eq C.4 \\ {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{1}}) + (γ_{2} - γ_{1}) ϵ_{abs} & from {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \leq ϵ_{abs} \\ \frac{{\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}})}{ϵ_{abs}} - γ_{2} \leq \frac{{\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}})}{ϵ_{abs}} - γ_{1} . \end{array}

Similarly, if $γ_{1} < γ_{2}$ and ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}})$ , ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) \leq ϵ_{abs}$ . Then, from Eq C.2

\begin{array}{l} {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{2}}) \\ {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) + (γ_{1} - γ_{2}) {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) & from Eq C.4 \\ {\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) + (γ_{1} - γ_{2}) ϵ_{abs} & from {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) \geq ϵ_{abs} \\ \frac{{\hat{h}}_{-, γ_{1}} ({\hat{g}}_{-, γ_{1}})}{ϵ_{abs}} - γ_{2} \leq \frac{{\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}})}{ϵ_{abs}} - γ_{2} . \end{array}

■

C.3. Proof of Proposition 11 (Nonnegative Weights for MR Lower Bound Binary Search)

Proof Let $γ_{1} : = \frac{1}{n - 1}$ . First we show that there exists a function ${\hat{g}}_{-, γ_{1}}$ minimizing ${\hat{h}}_{-, γ_{1}}$ such that $\hat{M R} ({\hat{g}}_{-, γ_{1}}) = 1$ . Let D_s denote the sample distribution of the data, and let D_m be the distribution satisfying

\begin{array}{l} P_{D_{m}} {(Y, X_{1}, X_{2}) = (Y, X_{1}, X_{2})} = P_{D_{s}} {(Y, X_{2}) = (y, x_{2})} \times P_{D_{s}} (X_{1} = x_{1}) \\ = \frac{1}{n^{2}} \sum_{i = 1}^{n} 1 (y_{[i]} = y and X_{2 [i]} = x_{2}) \sum_{j = 1}^{n} 1 (X_{1 [j]} = x_{1}) . \end{array}

From $γ_{1} = \frac{1}{n - 1}$ and Eq 6.2, we see that

\begin{array}{l} {\hat{h}}_{-, γ_{1}} (f) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} L {f, (y_{[i]}, X_{1 [j]}, X_{2 [i]})} \times {\frac{γ_{1} 1 (i = j)}{n} + \frac{1 (i \neq j)}{n (n - 1)}} \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} L {f, (y_{[i]}, X_{1 [j]}, X_{2 [i]})} \times {\frac{1}{n (n - 1)}} . \\ \propto E_{D_{m}} L {f, (Y, X_{1}, X_{2})} . \end{array}

Thus, from Condition 2 of Proposition 11, we know there exists a function ${\hat{g}}_{-, γ_{1}}$ that minimizes ${\hat{h}}_{-, γ_{1}}$ with ${\hat{g}}_{-, γ_{1}} (x_{1}^{(a)}, x_{2}) = {\hat{g}}_{-, γ_{1}} (x_{1}^{(b)}, x_{2})$ for any $x_{1}^{(a)}, x_{1}^{(b)} \in X_{1}$ and $x_{2} \in X_{2}$ . Condition 1 of Proposition 11 then implies that $L {{\hat{g}}_{-, γ_{1}}, (y, x_{1}^{(a)}, x_{2})} = L {{\hat{g}}_{-, γ_{1}}, (y, x_{1}^{(b)}, x_{2})}$ for any $x_{1}^{(a)}, x_{1}^{(b)} \in X_{1}$ , $x_{2} \in X_{2}$ , and $y \in Y$ . We apply this result in Line C.7, below, to show that loss of model ${\hat{g}}_{-, γ_{1}}$ is unaffected by permuting X₁ within our sample:

\begin{array}{l} {\hat{e}}_{switch} ({\hat{g}}_{-, γ_{1}}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{n - 1} \sum_{j \neq i} L {{\hat{g}}_{-, γ_{1}}, (y_{[i]}, X_{1 [j]}, X_{2 [j]})} \\ = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{n - 1} \sum_{j \neq i} L {{\hat{g}}_{-, γ_{1}}, (y_{[i]}, X_{1 [j]}, X_{2 [j]})} \\ = \frac{1}{n} \sum_{i = 1}^{n} L {{\hat{g}}_{-, γ_{1}}, (y_{[i]}, X_{1 [j]}, X_{2 [j]})} \\ = {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) . \end{array}

(C.7)

It follows that $\hat{M R} ({\hat{g}}_{-, γ_{1}}) = 1$ . To show the result of Proposition 11, let γ₂ = 0. For any function ${\hat{g}}_{-, γ_{2}}$ minimizing ${\hat{h}}_{-, γ_{2}}$ , we know that

\begin{matrix} {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) \leq {\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{1}}) & from the definition of {\hat{g}}_{-, γ_{2}} \\ 0 + {\hat{e}}_{switch} ({\hat{g}}_{-, γ_{2}}) \leq 0 + {\hat{e}}_{switch} ({\hat{g}}_{-, γ_{1}}) & from γ_{2} = 0 and the definition of {\hat{h}}_{-, γ_{2}} . \end{matrix}

(C.8)

From γ₂ ≤ γ₁, and Part 2 of Lemma 10, we know that

{\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) \geq {\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}}) .

(C.9)

Combining Eqs C.8 and C.9, we have

\hat{M R} ({\hat{g}}_{-, γ_{2}}) = \frac{{\hat{e}}_{switch} ({\hat{g}}_{-, γ_{2}})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}})} \leq \frac{{\hat{e}}_{switch} ({\hat{g}}_{-, γ_{1}})}{{\hat{e}}_{orig} ({\hat{g}}_{-, γ_{1}})} = \hat{M R} ({\hat{g}}_{-, γ_{1}}) = 1.

(C.10)

Since ${\hat{h}}_{-, γ_{2}} ({\hat{g}}_{-, γ_{2}}) = {\hat{e}}_{switch} ({\hat{g}}_{-, γ_{2}}) \geq 0$ by definition, Condition 8 holds for γ₂, ϵ_abs and ${\hat{g}}_{-, γ_{2}}$ if and only if ${\hat{e}}_{orig} ({\hat{g}}_{-, γ_{2}}) \leq ϵ_{abs}$ . This, combined with Eq C.10, completes the proof.

The same result does not necessarily hold if we replace ê_switch with ê_divide in our definitions of ${\hat{h}}_{-, γ}$ , $\hat{M R}$ , and ${\hat{M C R}}_{-}$ . This is because ê_divide does not correspond to the expectation over a distribution in which X₁ is independent of X₂ and Y, due to the fixed pairing structure used in ê_divide. Thus, Condition 2 of Proposition 11 will not apply. ■

C.4. Proof of Lemma 13 (Upper Bound for MR)

Proof We prove Lemma 13 in 2 parts.

C.4.1. Part 1: Showing Eq 6.4 Holds for All $f \in F$ Satisfying ${\hat{e}}_{ORIG} (f) \leq ϵ_{ABS}$

If ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) \geq 0$ , then for any function $f \in F$ satisfying ${\hat{e}}_{orig} (f) \leq ϵ_{abs}$ we know that

\begin{matrix} \frac{1}{ϵ_{abs}} \leq \frac{1}{{\hat{e}}_{orig} (f)} \\ \frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} \leq \frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} (f)} . \end{matrix}

(C.11)

Now, if γ ≤ 0, then for any $f \in F$ satisfying ${\hat{e}}_{orig} (f) \leq ϵ_{abs}$ , the definition of ĝ_+,γ implies

\begin{array}{l} {\hat{h}}_{+, γ} (f) \geq {\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) \\ {\hat{e}}_{orig} (f) + γ {\hat{e}}_{switch} (f) \geq {\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) \\ 1 + γ \frac{{\hat{e}}_{switch} (f)}{{\hat{e}}_{orig} (f)} \geq \frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} (f)} \\ 1 + γ \frac{{\hat{e}}_{switch} (f)}{{\hat{e}}_{orig} (f)} \geq \frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} & from Eq C.11 \\ 1 + γ \hat{M R} (f) \geq \frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} \\ \hat{M R} (f) \leq {\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} - 1} γ^{- 1} . \end{array}

C.4.2. Part 2: Showing that if f = ĝ_+,γ, and at Least One of the Enequalities in Condition 12 Holds with Equality, then Eq 6.4 Holds with Equality.

We consider each of the two inequalities in Condition 12 separately. If ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ}) = 0$ , then

\begin{array}{l} 0 = {\hat{e}}_{orig} ({\hat{g}}_{+, γ}) + γ {\hat{e}}_{switch} ({\hat{g}}_{+, γ}) \\ - γ {\hat{e}}_{switch} ({\hat{g}}_{+, γ}) = {\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \\ - {\frac{{\hat{e}}_{switch} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{+, γ})}} = \frac{1}{γ} . \end{array}

As a result,

{\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} - 1} γ^{- 1} = {\frac{0}{ϵ_{abs}} - 1} {\frac{{\hat{e}}_{switch} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{+, γ})}} = \hat{MR} ({\hat{g}}_{+, γ}) .

Alternatively, if ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \leq ϵ_{abs}$ , then

\begin{array}{l} {\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} - 1} γ^{- 1} = {\frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ}) + γ {\hat{e}}_{switch} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{+, γ})} - 1} γ^{- 1} = {1 + γ \frac{{\hat{e}}_{switch} ({\hat{g}}_{+, γ})}{{\hat{e}}_{orig} ({\hat{g}}_{+, γ})} - 1} γ^{- 1} \\ = \hat{MR} ({\hat{g}}_{+, γ}) . \end{array}

■

C.5. Proof of Lemma 14 (Monotonicity for MR Upper Bound Binary Search)

Proof We prove Lemma 14 in 3 parts.

C.5.1. Part 1: ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})$ is Monotonically Increasing in γ.

Let $γ_{1} < γ_{2} \in R$ satisfy $γ_{1} \leq γ_{2}$ . We have assumed that $0 \leq {\hat{e}}_{orig} (f)$ for any $f \in F$ . Thus, for any $f \in F$ we have

\begin{matrix} {\hat{e}}_{orig} (f) + γ_{1} {\hat{e}}_{switch} (f) < {\hat{e}}_{orig} (f) + γ_{2} {\hat{e}}_{switch} (f) \\ {\hat{h}}_{+, γ_{1}} (f) < {\hat{h}}_{+, γ_{2}} (f) . \end{matrix}

(C.12)

Applying this, we have

\begin{array}{l} {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) \leq {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{2}}) & from Eq C.2 \\ < {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{2}}) & from Eq C.12. \end{array}

C.5.2. Part 2: ${\hat{e}}_{ORIG} ({\hat{g}}_{+, γ_{2}})$ is Monotonically Decreasing in γ for γ ≤ 0, and Condition 12 Holds for γ = 0 and $ϵ_{ABS} \geq \min_{f \in F} {\hat{e}}_{ORIG} (f)$ .

Let $γ_{1} < γ_{2} \in R$ satisfy γ₁ < γ₂ ≤ 0. Then

\begin{matrix} {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) \leq {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{2}}) & from Eq C.2 \\ {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) + (γ_{1} - γ_{2}) {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) \leq {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{2}}) + (γ_{1} - γ_{2}) {\hat{e}}_{switch} (g_{γ 2}) & from Eq C.3 \\ (γ_{1} - γ_{2}) {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) \leq (γ_{1} - γ_{2}) {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{2}}) & from Eqs C.1 & C.2 \\ {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) \geq {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{2}}) \\ γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) \leq γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{2}}) & from γ_{2} \leq 0 . \end{matrix}

(C.13)

Now we are equipped to show the result that ${\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}})$ is monotonically decreasing in γ for γ ≤ 0:

\begin{matrix} {\hat{h}}_{+, γ 2} ({\hat{g}}_{+, γ 2}) \leq {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{1}}) & from Eq C.2 \\ {\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) + γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ 2}) + {\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) + γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) \\ {\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) + γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) & from Eqs C.1 & C.13. \end{matrix}

(C.14)

To show that Condition 12 holds for γ = 0 and $\min_{f \in F} {\hat{e}}_{orig} (f) \leq ϵ {}_{abs}$ , we first note that $h_{0, +} (g_{0, +}) = {\hat{e}}_{orig} (g_{0, +})$ , which is positive by assumption. Second, we note that

{\hat{e}}_{orig} (g_{0}_{, +}) = h_{0, +} (g_{0}_{, +}) = \min_{f \in F} h_{0, +} (f) = \min_{f \in F} {\hat{e}}_{orig} (f) \leq ϵ_{abs} .

C.5.3. Part 3: ${\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{ABS}} - 1} γ^{- 1}$ is Monotonically Increasing in γ in the Range Where ${\hat{e}}_{ORIG} ({\hat{g}}_{+, γ}) \leq ϵ_{ABS}$ and γ < 0, and Decreasing in the Range Where ${\hat{e}}_{ORIG} ({\hat{g}}_{+, γ}) > ϵ_{ABS}$ and γ < 0.

To prove the first result, suppose that γ₁ < γ₂ < 0 and ${\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}})$ , ${\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) \leq ϵ_{abs}$ . This implies

\begin{matrix} \frac{1}{γ_{2}} < \frac{1}{γ_{1}} \\ \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) - ϵ_{abs}}{γ_{2}} < \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) - ϵ_{abs}}{γ_{1}} . \end{matrix}

(C.15)

Then, starting with Eq C.2,

\begin{array}{l} {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{2}}) \leq {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{1}}) \\ {\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{2}}) \leq γ_{2} {\hat{e}}_{switch} ({\hat{g}}_{+, γ 1}) + {\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) \\ \frac{{\hat{h}}_{+, γ_{2}} ({\hat{g}}_{+, γ_{2}}) - ϵ_{abs}}{γ_{2}} \geq {\hat{e}}_{switch} ({\hat{g}}_{+, γ 1}) + \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) - ϵ_{abs}}{γ_{2}} \\ \geq {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{1}}) + \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}}) - ϵ_{abs}}{γ_{1}} & from Eq C.15 \\ = \frac{({\hat{h}}_{+, γ_{1}}) ({\hat{g}}_{+, γ 2}) - ϵ_{abs}}{γ_{1}} . \end{array}

Diving both sides of the above equation by ϵ_abs proves that ${\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} - 1} γ^{- 1}$ is monotonically decreasing in γ in the range where ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \leq ϵ_{abs}$ and γ < 0.

To prove the second result we proceed in the same way. Suppose that γ₁ < γ₂ < 0 and ${\hat{e}}_{orig} ({\hat{g}}_{+, γ_{1}})$ ; ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) \geq ϵ_{abs}$ . This implies

\begin{matrix} \frac{1}{γ_{2}} < \frac{1}{γ_{1}} \\ \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) - ϵ_{abs}}{γ_{2}} < \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) - ϵ_{abs}}{γ_{1}} . \end{matrix}

(C.16)

Then, starting with Eq C.2,

\begin{array}{l} {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) \leq {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{2}}) \\ {\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) \leq γ_{1} {\hat{e}}_{switch} ({\hat{g}}_{+, γ 2}) + {\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) \\ \frac{{\hat{h}}_{+, γ_{1}} ({\hat{g}}_{+, γ_{1}}) - ϵ_{abs}}{γ_{1}} \geq {\hat{e}}_{switch} ({\hat{g}}_{+, γ 2}) + \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) - ϵ_{abs}}{γ_{2}} \\ \geq {\hat{e}}_{switch} ({\hat{g}}_{+, γ_{2}}) + \frac{{\hat{e}}_{orig} ({\hat{g}}_{+, γ_{2}}) - ϵ_{abs}}{γ_{2}} & from Eq C.16 \\ = \frac{({\hat{h}}_{+, γ_{1}}) ({\hat{g}}_{+, γ 2}) - ϵ_{abs}}{γ_{2}} . \end{array}

Diving both sides of the above equation by ϵ_abs proves that $[{\frac{{\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})}{ϵ_{abs}} - 1} γ^{- 1}]$ is monotonically decreasing in γ in the range where ${\hat{e}}_{orig} ({\hat{g}}_{+, γ}) > ϵ_{abs}$ and γ < 0. ■

C.6. Proof of Remark 16 (Tractability of Empirical MCR for Linear Model Classes)

Proof To show Remark 16, we apply Proposition 15 to see that

\begin{array}{l} ξ_{orig} {\hat{e}}_{orig} (f_{β}) + ξ_{switch} {\hat{e}}_{switch} (f_{β}) \\ = \frac{ξ_{orig}}{n} {‖ y - X β ‖}_{2}^{2} + ξ_{switch} {\hat{e}}_{switch} (f_{β}) \\ = \frac{ξ_{orig}}{n} (y^{'} y - 2 y^{'} X β + β^{'} X^{'} X β) \\ + \frac{ξ_{switch}}{n} {y^{'} y - 2 [\begin{matrix} X_{1}^{'} W y \\ X_{2}^{'} y \end{matrix}] β + β^{'} [\begin{matrix} X_{1}^{'} X_{1} & X_{2}^{'} W X_{2} \\ X_{2}^{'} W X_{1} & X_{2}^{'} X_{2} \end{matrix}] β} \\ \propto_{β} - 2 q^{'} β + β^{'} Q β . \end{array}

■

C.7. Proof of Lemma 17 (Loss Upper Bound for Linear Models)

Proof Under the conditions in Lemma 17 and Eq 7.5, we can construct an upper bound on $L (f_{β,} (y, x)) = {(y - x^{'} β)}^{2}$ by either maximizing or minimizing x′β. First, we consider the maximization problem

\max_{β, x \in R^{p}} x^{'} β subjeted to x^{'} M_{lm}^{- 1} x \leq r_{X} and β^{'} M_{lm} β \leq r_{lm} .

(C.17)

We can see that both constraints hold with equality at the solution to this problem. Next, we apply the change of variables $\tilde{x} = \frac{1}{\sqrt{{}^{r}X}} D^{\frac{- 1}{2}} U^{'} x$ and $\tilde{β} = \frac{1}{\sqrt{r_{lm}}} D^{\frac{1}{2}} U^{'} β$ , where $U D U^{'} = M_{lm}$ is the eigendecomposition of M_lm. We obtain

\max_{\tilde{β}, \tilde{x} \in R^{p}} {\tilde{x}}^{'} \tilde{β} \sqrt{r_{X} r_{lm}} subject to {\tilde{x}}^{'} \tilde{x} = 1 and {\tilde{β}}^{'} \tilde{β} = 1,

which has an optimal objective value equal to $\sqrt{{}^{r}X^{r} lm}$ . By negating the objective in Eq C.17, we see that the minimum possible value of x′β, subject to the constraints in Eq 7.5 and Lemma 17, is found at $- \sqrt{{}^{r}X^{r} lm}$ . Thus, we know that

L (f, (y, x_{1}, x_{2})) \leq \max [{(\min_{y \in Y} y) - \sqrt{r_{X} r_{lm}}}^{2}, {(\max_{y \in Y} y) + \sqrt{r_{X} r_{lm}}}^{2}],

for any $(y, x_{1}, x_{2}) \in (Y, X_{1}, X_{2})$ . ■

C.8. Proof of Lemma 18 (Loss Upper Bound for Regression in a RKHS)

This proofs follows a similar structure as the proof in Section C.7. From the assumptions of Lemma 18, we know from Eq 7.7 that the largest possible output from a model $f_{α} \in F_{D, r_{k}}$ is

\begin{array}{l} μ + \max_{x \in R^{R}, α \in R^{R}} \sum_{i = 1}^{R} k (x, D_{[i, 1]}) α_{[i]} & subjected to υ {(x)}^{'} K_{D}^{- 1} υ (x) \leq r_{D} and α^{'} K_{D} α \leq r_{k} \\ = μ + \max_{x \in R^{R}, α \in R^{R}} υ (x)^{'} α & subjected to υ {(x)}^{'} K_{D}^{- 1} υ (x) \leq r_{D} and α^{'} K_{D} α \leq r_{k} \\ \leq μ + \max_{z, α \in R^{R}} z^{'} α & subjected to z^{'} K_{D}^{- 1} z \leq r_{D} and α^{'} K_{D} α \leq r_{k} . \end{array}

The above problem can be solved in the same way as Eq C.17, and has a solution at $(μ + \sqrt{r_{D} r_{k}})$ . The smallest possible model output will similarly be lower bounded by − $(μ + \sqrt{r_{D} r_{k}})$ . Thus, B_ind is less than or equal to

\max [{\min_{y \in Y} (y) - (μ + \sqrt{r_{D} r_{k}})}^{2}, {\max_{y \in Y} (y) + (μ + \sqrt{r_{D} r_{k}})}^{2}] .

Contributor Information

Aaron Fisher, Takeda Pharmaceuticals, Cambridge, MA 02139, USA.

Cynthia Rudin, Departments of Computer Science and Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA.

Francesca Dominici, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

References

Altmann André, Toloşi Laura, Sander Oliver, and Lengauer Thomas. Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347, 2010. [DOI] [PubMed] [Google Scholar]
Archer Kellie J and Kimes Ryan V. Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4):2249–2260, 2008. [Google Scholar]
Azen Razia, Budescu David V, and Reiser Benjamin. Criticality of predictors in multiple regression. British Journal of Mathematical and Statistical Psychology, 54(2):201–225, 2001. [DOI] [PubMed] [Google Scholar]
Beckett Katherine, Nyrop Kris, and Pfingst Lori. Race, drugs, and policing: understanding disparities in drug delivery arrests. Criminology, 44(1):105–137, 2006. [Google Scholar]
Blair Irene V, Judd Charles M, and Chapleau Kristine M. The influence of afrocentric facial features in criminal sentencing. Psychological science, 15(10):674–679, 2004. [DOI] [PubMed] [Google Scholar]
Boyd Stephen and Vandenberghe Lieven. Convex Optimization. Cambridge university press, 2004. [Google Scholar]
Breiman Leo. Random forests. Machine learning, 45(1):5–32, 2001. [Google Scholar]
Breiman Leo et al. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical science, 16(3):199–231, 2001. [Google Scholar]
Calle M Luz and Urrea Víctor. Letter to the editor: stability of random forest importance measures. Briefings in bioinformatics, 12(1):86–89, 2010. [DOI] [PubMed] [Google Scholar]
Chipman Hugh A, George Edward I, McCulloch Robert E, et al. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010. [Google Scholar]
Chouldechova Alexandra. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017. [DOI] [PubMed] [Google Scholar]
Coker Beau, Rudin Cynthia, and King Gary. A theory of statistical inference for ensuring the robustness of scientific results. arXiv preprint arXiv:1804.08646, 2018. [Google Scholar]
Corbett-Davies Sam, Pierson Emma, Feller Avi, and Goel Sharad. A computer program used for bail and sentencing decisions was labeled biased against blacks. it’s actually not that clear. The Washington Post, October 2016. URL https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/?utm_term=.e896ff1e4107. [Google Scholar]
Corbett-Davies Sam, Pierson Emma, Feller Avi, Goel Sharad, and Huq Aziz. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM, 2017. [Google Scholar]
Datta Anupam, Sen Shayak, and Zick Yair. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016. [Google Scholar]
DeLong Elizabeth R, DeLong David M, and Clarke-Pearson Daniel L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3):837–845, 1988. [PubMed] [Google Scholar]
Demler Olga V, Pencina Michael J, and D’Agostino Ralph B Sr. Misuse of delong test to compare aucs for nested models. Statistics in medicine, 31(23):2577–2587, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Díaz Iván, Hubbard Alan, Decker Anna, and Cohen Mitchell. Variable importance and prediction methods for longitudinal problems with missing variables. PloS one, 10(3): e0120031, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dinkelbach Werner. On nonlinear fractional programming. Management science, 13(7): 492–498, 1967. [Google Scholar]
Dong Jiayun and Rudin Cynthia. Variable importance clouds: A way to explore variable importance for the set of good models. arXiv preprint arXiv:1901.03209, 2019. [Google Scholar]
Dorfman R. A note on the delta-method for finding variance formulae. The Biometric Bulletin, 1(129–137):92, 1938. [Google Scholar]
Dwork Cynthia, Hardt Moritz, Pitassi Toniann, Reingold Omer, and Zemel Richard. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012. [Google Scholar]
Gevrey Muriel, Dimopoulos Ioannis, and Lek Sovan. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological modelling, 160(3):249–264, 2003. [Google Scholar]
Gregorutti Baptiste, Michel Bertrand, and Saint-Pierre Philippe. Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis, 90:15–35, 2015. [Google Scholar]
Gregorutti Baptiste, Michel Bertrand, and Saint-Pierre Philippe. Correlation and variable importance in random forests. Statistics and Computing, 27(3):659–678, 2017. [Google Scholar]
Hapfelmeier Alexander, Hothorn Torsten, Ulm Kurt, and Strobl Carolin. A new variable importance measure for random forests with missing data. Statistics and Computing, 24 (1):21–34, 2014. [Google Scholar]
Hastie T, Tibshirani R, and Friedman J. The elements of statistical learning 2nd edition. New York: Springer, 2009. [Google Scholar]
Heider Karl G. The Rashomon effect: when ethnographers disagree. American Anthropologist, 90(1):73–81, 1988. [Google Scholar]
Hoeffding Wassily. A class of statistics with asymptotically normal distribution. The annals of mathematical statistics, pages 293–325, 1948. [Google Scholar]
Hoeffding Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. doi: 10.1080/01621459.1963.10500830. URL https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1963.10500830. [DOI] [Google Scholar]
Hooker Giles. Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics, 16(3):709–732, 2007. [Google Scholar]
Horst Reiner and Thoai Nguyen V. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1–43, 1999. [Google Scholar]
Kamiran Faisal, Žliobaitė Indrė, and Calders Toon. Quantifying explainable discrimination and removing illegal discrimination in automated decision making. Knowledge and information systems, 35(3):613–644, 2013. [Google Scholar]
Kazemitabar Jalil, Amini Arash, Bloniarz Adam, and Talwalkar Ameet S. Variable importance using decision trees. In Advances in Neural Information Processing Systems, pages 425–434, 2017. [Google Scholar]
Kleinberg Jon, Mullainathan Sendhil, and Raghavan Manish. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. [Google Scholar]
Larson Jeff, Mattu Surya, Kirchner Lauren, and Angwin Julia. How we analyzed the compas recidivism algorithm. ProPublica, May 2016. URL https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.
Lecué Guillaume. Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. PhD thesis, Université Paris-Est, 2011. [Google Scholar]
Lehmann Erich L and Casella George. Theory of point estimation. Springer Science & Business Media, 2006. [Google Scholar]
Letham Benjamin, Letham Portia A, Rudin Cynthia, and Browne Edward P. Prediction uncertainty and optimal experimental design for learning dynamical systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 26(6):063110, 2016. [DOI] [PubMed] [Google Scholar]
Louppe Gilles, Wehenkel Louis, Sutera Antonio, and Geurts Pierre. Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems, pages 431–439, 2013. [Google Scholar]
Lum Kristian and Isaac William. To predict and serve? Significance, 13(5):14–19, 2016. [Google Scholar]
Meinshausen Nicolai and Bühlmann Peter. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Google Scholar]
Mentch Lucas and Hooker Giles. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016. [Google Scholar]
Monahan John and Skeem Jennifer L. Risk assessment in criminal sentencing. Annual review of clinical psychology, 12:489–513, 2016. [DOI] [PubMed] [Google Scholar]
Nabi Razieh and Shpitser Ilya. Fair inference on outcomes. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 2018, page 1931. NIH Public Access, 2018. [PMC free article] [PubMed] [Google Scholar]
Nevo Daniel and Ritov Ya’acov. Identifying a minimal class of models for high-dimensional data. The Journal of Machine Learning Research, 18(1):797–825, 2017. [Google Scholar]
Olden Julian D, Joy Michael K, and Death Russell G. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling, 178(3):389–397, 2004. [Google Scholar]
Park Jaehyun and Boyd Stephen. General heuristics for nonconvex quadratically constrained quadratic programming. arXiv preprint arXiv:1703.07870, 2017. [Google Scholar]
Paternoster Raymond and Brame Robert. Reassessing race disparities in maryland capital cases. Criminology, 46(4):971–1008, 2008. [Google Scholar]
Picard-Fritsche Sarah, Rempel Michael, Tallon Jennifer A., Adler Julian, and Reyes Natalie. Demystifying risk assessment, key principles and controversies. Technical report, 2017. Available at https://www.courtinnovation.org/publications/demystifying-risk-assessment-key-principles-and-controversies.
Pólik Imre and Terlaky Tamás. A survey of the s-lemma. SIAM review, 49(3):371–418, 2007. [Google Scholar]
Ramchand Rajeev, Pacula Rosalie Liccardo, and Iguchi Martin Y. Racial differences in marijuana-users’ risk of arrest in the united states. Drug and alcohol dependence, 84(3): 264–272, 2006. [DOI] [PubMed] [Google Scholar]
Recknagel Friedrich, French Mark, Harkonen Pia, and Yabunaka Ken-Ichi. Artificial neural network approach for modelling and prediction of algal blooms. Ecological Modelling, 96 (1–3):11–28, 1997. [Google Scholar]
Roth Wendy D and Mehta Jal D. The Rashomon effect: combining positivist and interpretivist approaches in the analysis of contested events. Sociological Methods & Research, 31(2):131–173, 2002. [Google Scholar]
Rudin Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206–215, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rudin Cynthia, Wang Caroline, and Coker Beau. The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2019. accepted. [Google Scholar]
Scardi Michele and Harding Lawrence W. Developing an empirical model of phytoplankton primary production: a neural network case study. Ecological modelling, 120(2):213–223, 1999. [Google Scholar]
Semenova Lesia and Rudin Cynthia. A study in rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. arXiv preprint arXiv:1908.01755, 2019. [Google Scholar]
Serfling Robert J. Approximation theorems of mathematical statistics. John Wiley & Sons, 1980. [Google Scholar]
Spohn Cassia. Thirty years of sentencing reform: the quest for a racially neutral sentencing process. Criminal justice, 3:427–501, 2000. [Google Scholar]
Statnikov Alexander, Lytkin Nikita I, Lemeire Jan, and Aliferis Constantin F. Algorithms for discovery of multiple markov boundaries. Journal of Machine Learning Research, 14 (Feb):499–566, 2013. [PMC free article] [PubMed] [Google Scholar]
Strobl Carolin, Boulesteix Anne-Laure, Zeileis Achim, and Hothorn Torsten. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC bioinformatics, 8(1):25, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Strobl Carolin, Boulesteix Anne-Laure, Kneib Thomas, Augustin Thomas, and Zeileis Achim. Conditional variable importance for random forests. BMC bioinformatics, 9 (1):307, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stuart Elizabeth A. Matching methods for causal inference: a review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Toloşi Laura and Lengauer Thomas. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14):1986–1994, 2011. [DOI] [PubMed] [Google Scholar]
Tulabandhula Theja and Rudin Cynthia. Robust optimization using machine learning for uncertainty sets. arXiv preprint arXiv:1407.1097, 2014. [Google Scholar]
U.S. Department of Justice - Civil Rights Devision. Investigation of the Baltimore City Police Department, August 2016. Available at https://www.justice.gov/crt/file/883296/download.
van der Laan Mark J. Statistical inference for variable importance. The International Journal of Biostatistics, 2(1), 2006. [Google Scholar]
Ver Hoef Jay M. Who invented the delta method? The American Statistician, 66(2): 124–127, 2012. [Google Scholar]
Wang Huazhen, Yang Fan, and Luo Zhiyuan. An experimental study of the intrinsic stability of random forest variable importance measures. BMC bioinformatics, 17(1):60, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williamson Brian D, Gilbert Peter B, Simon Noah, and Carone Marco. Nonparametric variable importance assessment using machine learning techniques. bepress (unpublished preprint), 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yao Jingtao, Teng Nicholas, Poh Hean-Lee, and Tan Chew Lim. Forecasting and analysis of marketing data using neural networks. J. Inf. Sci. Eng, 14(4):843–862, 1998. [Google Scholar]
Zhu Ruoqing, Zeng Donglin, and Kosorok Michael R. Reinforcement learning trees. Journal of the American Statistical Association, 110(512):1770–1784, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Altmann André, Toloşi Laura, Sander Oliver, and Lengauer Thomas. Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347, 2010. [DOI] [PubMed] [Google Scholar]

[R2] Archer Kellie J and Kimes Ryan V. Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4):2249–2260, 2008. [Google Scholar]

[R3] Azen Razia, Budescu David V, and Reiser Benjamin. Criticality of predictors in multiple regression. British Journal of Mathematical and Statistical Psychology, 54(2):201–225, 2001. [DOI] [PubMed] [Google Scholar]

[R4] Beckett Katherine, Nyrop Kris, and Pfingst Lori. Race, drugs, and policing: understanding disparities in drug delivery arrests. Criminology, 44(1):105–137, 2006. [Google Scholar]

[R5] Blair Irene V, Judd Charles M, and Chapleau Kristine M. The influence of afrocentric facial features in criminal sentencing. Psychological science, 15(10):674–679, 2004. [DOI] [PubMed] [Google Scholar]

[R6] Boyd Stephen and Vandenberghe Lieven. Convex Optimization. Cambridge university press, 2004. [Google Scholar]

[R7] Breiman Leo. Random forests. Machine learning, 45(1):5–32, 2001. [Google Scholar]

[R8] Breiman Leo et al. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical science, 16(3):199–231, 2001. [Google Scholar]

[R9] Calle M Luz and Urrea Víctor. Letter to the editor: stability of random forest importance measures. Briefings in bioinformatics, 12(1):86–89, 2010. [DOI] [PubMed] [Google Scholar]

[R10] Chipman Hugh A, George Edward I, McCulloch Robert E, et al. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010. [Google Scholar]

[R11] Chouldechova Alexandra. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017. [DOI] [PubMed] [Google Scholar]

[R12] Coker Beau, Rudin Cynthia, and King Gary. A theory of statistical inference for ensuring the robustness of scientific results. arXiv preprint arXiv:1804.08646, 2018. [Google Scholar]

[R13] Corbett-Davies Sam, Pierson Emma, Feller Avi, and Goel Sharad. A computer program used for bail and sentencing decisions was labeled biased against blacks. it’s actually not that clear. The Washington Post, October 2016. URL https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/?utm_term=.e896ff1e4107. [Google Scholar]

[R14] Corbett-Davies Sam, Pierson Emma, Feller Avi, Goel Sharad, and Huq Aziz. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM, 2017. [Google Scholar]

[R15] Datta Anupam, Sen Shayak, and Zick Yair. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016. [Google Scholar]

[R16] DeLong Elizabeth R, DeLong David M, and Clarke-Pearson Daniel L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3):837–845, 1988. [PubMed] [Google Scholar]

[R17] Demler Olga V, Pencina Michael J, and D’Agostino Ralph B Sr. Misuse of delong test to compare aucs for nested models. Statistics in medicine, 31(23):2577–2587, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Díaz Iván, Hubbard Alan, Decker Anna, and Cohen Mitchell. Variable importance and prediction methods for longitudinal problems with missing variables. PloS one, 10(3): e0120031, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Dinkelbach Werner. On nonlinear fractional programming. Management science, 13(7): 492–498, 1967. [Google Scholar]

[R20] Dong Jiayun and Rudin Cynthia. Variable importance clouds: A way to explore variable importance for the set of good models. arXiv preprint arXiv:1901.03209, 2019. [Google Scholar]

[R21] Dorfman R. A note on the delta-method for finding variance formulae. The Biometric Bulletin, 1(129–137):92, 1938. [Google Scholar]

[R22] Dwork Cynthia, Hardt Moritz, Pitassi Toniann, Reingold Omer, and Zemel Richard. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012. [Google Scholar]

[R23] Gevrey Muriel, Dimopoulos Ioannis, and Lek Sovan. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological modelling, 160(3):249–264, 2003. [Google Scholar]

[R24] Gregorutti Baptiste, Michel Bertrand, and Saint-Pierre Philippe. Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis, 90:15–35, 2015. [Google Scholar]

[R25] Gregorutti Baptiste, Michel Bertrand, and Saint-Pierre Philippe. Correlation and variable importance in random forests. Statistics and Computing, 27(3):659–678, 2017. [Google Scholar]

[R26] Hapfelmeier Alexander, Hothorn Torsten, Ulm Kurt, and Strobl Carolin. A new variable importance measure for random forests with missing data. Statistics and Computing, 24 (1):21–34, 2014. [Google Scholar]

[R27] Hastie T, Tibshirani R, and Friedman J. The elements of statistical learning 2nd edition. New York: Springer, 2009. [Google Scholar]

[R28] Heider Karl G. The Rashomon effect: when ethnographers disagree. American Anthropologist, 90(1):73–81, 1988. [Google Scholar]

[R29] Hoeffding Wassily. A class of statistics with asymptotically normal distribution. The annals of mathematical statistics, pages 293–325, 1948. [Google Scholar]

[R30] Hoeffding Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. doi: 10.1080/01621459.1963.10500830. URL https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1963.10500830. [DOI] [Google Scholar]

[R31] Hooker Giles. Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics, 16(3):709–732, 2007. [Google Scholar]

[R32] Horst Reiner and Thoai Nguyen V. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1–43, 1999. [Google Scholar]

[R33] Kamiran Faisal, Žliobaitė Indrė, and Calders Toon. Quantifying explainable discrimination and removing illegal discrimination in automated decision making. Knowledge and information systems, 35(3):613–644, 2013. [Google Scholar]

[R34] Kazemitabar Jalil, Amini Arash, Bloniarz Adam, and Talwalkar Ameet S. Variable importance using decision trees. In Advances in Neural Information Processing Systems, pages 425–434, 2017. [Google Scholar]

[R35] Kleinberg Jon, Mullainathan Sendhil, and Raghavan Manish. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. [Google Scholar]

[R36] Larson Jeff, Mattu Surya, Kirchner Lauren, and Angwin Julia. How we analyzed the compas recidivism algorithm. ProPublica, May 2016. URL https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.

[R37] Lecué Guillaume. Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. PhD thesis, Université Paris-Est, 2011. [Google Scholar]

[R38] Lehmann Erich L and Casella George. Theory of point estimation. Springer Science & Business Media, 2006. [Google Scholar]

[R39] Letham Benjamin, Letham Portia A, Rudin Cynthia, and Browne Edward P. Prediction uncertainty and optimal experimental design for learning dynamical systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 26(6):063110, 2016. [DOI] [PubMed] [Google Scholar]

[R40] Louppe Gilles, Wehenkel Louis, Sutera Antonio, and Geurts Pierre. Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems, pages 431–439, 2013. [Google Scholar]

[R41] Lum Kristian and Isaac William. To predict and serve? Significance, 13(5):14–19, 2016. [Google Scholar]

[R42] Meinshausen Nicolai and Bühlmann Peter. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Google Scholar]

[R43] Mentch Lucas and Hooker Giles. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881, 2016. [Google Scholar]

[R44] Monahan John and Skeem Jennifer L. Risk assessment in criminal sentencing. Annual review of clinical psychology, 12:489–513, 2016. [DOI] [PubMed] [Google Scholar]

[R45] Nabi Razieh and Shpitser Ilya. Fair inference on outcomes. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 2018, page 1931. NIH Public Access, 2018. [PMC free article] [PubMed] [Google Scholar]

[R46] Nevo Daniel and Ritov Ya’acov. Identifying a minimal class of models for high-dimensional data. The Journal of Machine Learning Research, 18(1):797–825, 2017. [Google Scholar]

[R47] Olden Julian D, Joy Michael K, and Death Russell G. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling, 178(3):389–397, 2004. [Google Scholar]

[R48] Park Jaehyun and Boyd Stephen. General heuristics for nonconvex quadratically constrained quadratic programming. arXiv preprint arXiv:1703.07870, 2017. [Google Scholar]

[R49] Paternoster Raymond and Brame Robert. Reassessing race disparities in maryland capital cases. Criminology, 46(4):971–1008, 2008. [Google Scholar]

[R50] Picard-Fritsche Sarah, Rempel Michael, Tallon Jennifer A., Adler Julian, and Reyes Natalie. Demystifying risk assessment, key principles and controversies. Technical report, 2017. Available at https://www.courtinnovation.org/publications/demystifying-risk-assessment-key-principles-and-controversies.

[R51] Pólik Imre and Terlaky Tamás. A survey of the s-lemma. SIAM review, 49(3):371–418, 2007. [Google Scholar]

[R52] Ramchand Rajeev, Pacula Rosalie Liccardo, and Iguchi Martin Y. Racial differences in marijuana-users’ risk of arrest in the united states. Drug and alcohol dependence, 84(3): 264–272, 2006. [DOI] [PubMed] [Google Scholar]

[R53] Recknagel Friedrich, French Mark, Harkonen Pia, and Yabunaka Ken-Ichi. Artificial neural network approach for modelling and prediction of algal blooms. Ecological Modelling, 96 (1–3):11–28, 1997. [Google Scholar]

[R54] Roth Wendy D and Mehta Jal D. The Rashomon effect: combining positivist and interpretivist approaches in the analysis of contested events. Sociological Methods & Research, 31(2):131–173, 2002. [Google Scholar]

[R55] Rudin Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206–215, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Rudin Cynthia, Wang Caroline, and Coker Beau. The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2019. accepted. [Google Scholar]

[R57] Scardi Michele and Harding Lawrence W. Developing an empirical model of phytoplankton primary production: a neural network case study. Ecological modelling, 120(2):213–223, 1999. [Google Scholar]

[R58] Semenova Lesia and Rudin Cynthia. A study in rashomon curves and volumes: a new perspective on generalization and model simplicity in machine learning. arXiv preprint arXiv:1908.01755, 2019. [Google Scholar]

[R59] Serfling Robert J. Approximation theorems of mathematical statistics. John Wiley & Sons, 1980. [Google Scholar]

[R60] Spohn Cassia. Thirty years of sentencing reform: the quest for a racially neutral sentencing process. Criminal justice, 3:427–501, 2000. [Google Scholar]

[R61] Statnikov Alexander, Lytkin Nikita I, Lemeire Jan, and Aliferis Constantin F. Algorithms for discovery of multiple markov boundaries. Journal of Machine Learning Research, 14 (Feb):499–566, 2013. [PMC free article] [PubMed] [Google Scholar]

[R62] Strobl Carolin, Boulesteix Anne-Laure, Zeileis Achim, and Hothorn Torsten. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC bioinformatics, 8(1):25, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] Strobl Carolin, Boulesteix Anne-Laure, Kneib Thomas, Augustin Thomas, and Zeileis Achim. Conditional variable importance for random forests. BMC bioinformatics, 9 (1):307, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Stuart Elizabeth A. Matching methods for causal inference: a review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Toloşi Laura and Lengauer Thomas. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14):1986–1994, 2011. [DOI] [PubMed] [Google Scholar]

[R66] Tulabandhula Theja and Rudin Cynthia. Robust optimization using machine learning for uncertainty sets. arXiv preprint arXiv:1407.1097, 2014. [Google Scholar]

[R67] U.S. Department of Justice - Civil Rights Devision. Investigation of the Baltimore City Police Department, August 2016. Available at https://www.justice.gov/crt/file/883296/download.

[R68] van der Laan Mark J. Statistical inference for variable importance. The International Journal of Biostatistics, 2(1), 2006. [Google Scholar]

[R69] Ver Hoef Jay M. Who invented the delta method? The American Statistician, 66(2): 124–127, 2012. [Google Scholar]

[R70] Wang Huazhen, Yang Fan, and Luo Zhiyuan. An experimental study of the intrinsic stability of random forest variable importance measures. BMC bioinformatics, 17(1):60, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] Williamson Brian D, Gilbert Peter B, Simon Noah, and Carone Marco. Nonparametric variable importance assessment using machine learning techniques. bepress (unpublished preprint), 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Yao Jingtao, Teng Nicholas, Poh Hean-Lee, and Tan Chew Lim. Forecasting and analysis of marketing data using neural networks. J. Inf. Sci. Eng, 14(4):843–862, 1998. [Google Scholar]

[R73] Zhu Ruoqing, Zeng Donglin, and Kosorok Michael R. Reinforcement learning trees. Journal of the American Statistical Association, 110(512):1770–1784, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously

Aaron Fisher

Cynthia Rudin

Francesca Dominici

Abstract

1. Introduction

2. Notation & Technical Summary

2.1. Summary of Rashomon Sets & Model Class Reliance

Figure 1:

Figure 2:

3. Model Reliance

3.1. Estimating Model Reliance with U-statistics, and Connections to Permutation-based Variable Importance

3.2. Limitations of Existing Variable Importance Methods

4. Model Class Reliance

4.1. Motivating Empirical Estimators of MCR by Deriving Finite-sample Bounds

Figure 3:

5. Extensions of Rashomon Sets Beyond Variable Importance

5.1. Finite-sample Confidence Intervals from Rashomon Sets

5.2. Related Literature on the Rashomon Effect

6. Calculating Empirical Estimates of Model Class Reliance

Table 1:

Figure 8:

6.1. Binary Search for Empirical MR Lower Bound

Figure 4:

Figure 5:

6.2. Binary Search for Empirical MR Upper Bound

6.3. Convex Models

7. MR & MCR for Linear Models, Additive Models, and Regression Models in a Reproducing Kernel Hilbert Space

7.1. Interpreting and Computing MR for Linear or Additive Models

7.2. Computing Empirical MCR for Linear Models

7.3. Regularized Linear Models

7.3.1. Calculating MCR

7.3.2. Upper Bounding the Loss

7.4. Regression Models in a Reproducing Kernel Hilbert Space (RKHS)

7.4.1. Calculating MCR

7.4.2. Upper Bounding the Loss

8. Connections Between MR and Causality

8.1. Model Reliance and Causal Effects

8.2. Conditional Importance: Adjusting for Dependence Between X1 and X2

8.2.1. Estimation of CMR by Weighting, Matching, or Imputation

9. Simulations

9.1. Illustrative Toy Example with Simulated Data

Figure 6:

9.2. Simulations of Bootstrap Confidence Intervals

9.2.1. Results

Figure 7:

10. Data Analysis: Reliance of Criminal Recidivism Prediction Models on Race and Sex

10.1. Results

10.2. Discussion & Limitations

11. Conclusion

Acknowledgments

Appendix A. Miscellaneous Supplemental Sections

A.1. Code

A.2. Model Reliance Less than 1

A.3. Relating e^switch(f) to All Possible Permutations of the Sample

A.4. Bound for MR of the Best-in-class Prediction Model

A.5. Ratios versus Differences in MR Definition

A.6. Rashomon Sets and Profile Likelihood Intervals

A.7. Unbiased Estimates of CMR

Appendix B. Proofs for Statistical Results

B.1. Lemma Relating Empirical and Population Rashomon Sets

B.2. Lemma to Transform Between Bounds

B.3. Proof of Theorem 4

B.3.1. Step 1: Show that P[MR^(f+,ϵ)≤MCR^+(ϵOUT)]≥1−δ3.

B.3.2. Step 2: Conditional on MR^(f+,ϵ)≤MCR^+(ϵout), Upper Bound MR(f+,ϵ) by MCR^+(ϵout) Added to an Error Term.

B.3.3. Step 3: Probabilistically Bound the Error Term from Step 2.

B.3.4. Step 4: Combine Results to Show Eq 4.2

B.4. Proof of Corollary 22

B.5. Proof of Theorems 5 & 6

B.5.1. Proof of Theorem 5, and Other Limits on Estimation Error, Based on Covering Number

B.5.2. Proof of Eq B.22

B.5.3. Proof of Eq B.23

B.5.4. Proof of Eq B.24

B.5.5. Proof for Eq B.25

B.5.6. Proof for Eq B.26

B.5.7. Implementing Theorem 25 to Show Theorem 6

B.6. Proof of Proposition 7, and Corollary for a Unique Best-in-class Model.

B.6.1. Proof of Proposition 7

B.6.2. Corollary for a Unique Best-in-Class Model

8.2. Conditional Importance: Adjusting for Dependence Between X₁ and X₂

A.3. Relating ${\hat{e}}_{switch} (f)$ to All Possible Permutations of the Sample

B.3.1. Step 1: Show that $P [\hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{OUT})] \geq 1 - \frac{δ}{3} .$

B.3.2. Step 2: Conditional on $\hat{M R} (f_{+, ϵ}) \leq {\hat{M C R}}_{+} (ϵ_{out})$ , Upper Bound MR(f_+,ϵ) by ${\hat{M C R}}_{+} (ϵ_{out})$ Added to an Error Term.

C.1.1. Part 1: Showing Eq 6.1 Holds for All $f \in F$ Satisfying ê_orig(f) ≤ ϵ_abs.

C.1.2. Part 2: Showing that, if $f = {\hat{g}}_{-, γ}$ , and at Least One of the Inequalities in Condition 8 Holds with Equality, then Eq 6.1 Holds with Equality.

C.2.1. Part 1: ${\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})$ is Monotonically Increasing in γ.

C.2.2. Part 2: ${\hat{e}}_{ORIG} ({\hat{g}}_{-, γ})$ is Monotonically Decreasing in γ.

C.2.3. Part 3: ${\frac{{\hat{h}}_{-, γ} ({\hat{g}}_{-, γ})}{ϵ_{ABS}} - γ}$ is Monotonically Decreasing in γ in the Range Where ${\hat{e}}_{ORIG} ({\hat{g}}_{-, γ}) \leq ϵ_{ABS}$ , and Increasing Otherwise.

C.4.1. Part 1: Showing Eq 6.4 Holds for All $f \in F$ Satisfying ${\hat{e}}_{ORIG} (f) \leq ϵ_{ABS}$

C.4.2. Part 2: Showing that if f = ĝ_+,γ, and at Least One of the Enequalities in Condition 12 Holds with Equality, then Eq 6.4 Holds with Equality.

C.5.1. Part 1: ${\hat{h}}_{+, γ} ({\hat{g}}_{+, γ})$ is Monotonically Increasing in γ.

C.5.2. Part 2: ${\hat{e}}_{ORIG} ({\hat{g}}_{+, γ_{2}})$ is Monotonically Decreasing in γ for γ ≤ 0, and Condition 12 Holds for γ = 0 and $ϵ_{ABS} \geq \min_{f \in F} {\hat{e}}_{ORIG} (f)$ .