A Path to Simpler Models Starts With Noise

Lesia Semenova; Harry Chen; Ronald Parr; Cynthia Rudin

. Author manuscript; available in PMC: 2024 Apr 4.

Published in final edited form as: Adv Neural Inf Process Syst. 2023 Dec;36:3362–3401.

A Path to Simpler Models Starts With Noise

Lesia Semenova ¹, Harry Chen ¹, Ronald Parr ¹, Cynthia Rudin ¹

PMCID: PMC10993912 NIHMSID: NIHMS1981181 PMID: 38577617

Abstract

The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.

1. Introduction

It is possible that for many datasets, a simple predictive model can perform as well as the best black box model – we simply have not found the simple model yet. Interestingly, however, we may be able to infer whether such a simple model exists without finding it first, and we may be able to determine conditions under which such simple models are likely to exist.

We already know that many datasets exhibit the “Rashomon effect” [7], which is that for many real-world tabular datasets, many models can describe the data equally well. If there are many good models, it is more likely that at least one of them is simple (e.g., sparse) [39]. Thus, a key question in determining the existence of simpler models is to understand why and when the Rashomon effect happens. This is a difficult question, and there has been little study of it. The literature on the Rashomon effect has generally been more practical, showing either that the Rashomon effect often exists in practice [39, 13, 45], showing how to compute or visualize the set of good models for a given dataset [52, 14, 15, 1, 32, 53, 54, 48], or trying to reduce underspecification by learning a diverse ensemble of models [25, 37]. However, no prior works have focused on understanding what causes this phenomenon in the first place.

Our thesis is that noise is both a theoretical and practical motivator for the adoption of simpler models. Specifically, in this work, we refer to noise in the generation process that determines the labels. In noisy problems, the label is more difficult to predict. Data about humans, such as medical data or criminal justice data, are often noisy because many things worth predicting (such as whether someone will commit a crime within 2 years of release from prison, or whether someone will experience a medical condition within the next year) have inherent randomness that is tied to random processes in the world (Will the person get a new job? How will their genetics interact with their diet?). It might sound intuitive that noisy data would lead to simpler models being useful, but this is not something most machine learning practitioners have internalized – even on noisy datasets, they often use complicated, black box models, to which post-hoc explanations are added. Our work shows how practitioners who understand the bias-variance trade-off naturally gravitate towards more interpretable modes in the presence of noise.

We propose a path which begins with noise, is followed by decisions made by human analysts to compensate for that noise, and that ultimately leads to simpler models. In more detail, our path follows these steps: 1) Noise in the world leads to increased variance of the labels. 2) Higher label variance leads to worse generalization (larger differences between training and test/validation performance). 3) Poor generalization from the training set to the validation set is detected by analysts on the dataset using techniques such as cross-validation. As a result, the analyst compensates for anticipated poor test performance in a way that follows statistical learning theory. Specifically, they choose a simpler hypothesis space, either through soft constraints (i.e., increasing regulation), hard constraints (explicit limits on model complexity, or model sparsification), or by switching to a simpler function class. Here, the analyst may lose performance on the training set but gain validation and test performance. 4) After reducing the complexity of the hypothesis space, the analyst’s new hypothesis space has a larger Rashomon ratio than their original hypothesis space. The Rashomon ratio is the fraction of models in the function class that perform close to the empirical risk minimizer. It is the fraction of functions that performs approximately-equally-well to the best one. This set of “good” functions is called the Rashomon set, and the Rashomon ratio measures the size of the Rashomon set relative to the function class. This argument (that lower complexity function classes leads to larger Rashomon ratios) is not necessarily intuitive, but we show it empirically for 19 datasets. Additionally, we prove this holds for decision trees of various depths under natural assumptions. The argument boils down to showing that the set of non-Rashomon set models grows exponentially faster than the set of models inside the Rashomon set. As a result, since the analyst’s hypothesis space now has a large Rashomon ratio, a relatively large fraction of models that are left in the simpler hypothesis are good, meaning they perform approximately as well as the best models in that hypothesis space. From that large set, the analyst may be able to find even a simpler model from a smaller space that also performs well, following the argument of Semenova et al. [39]. As a reminder, in Step 3 the analysts discovered that using a simpler model class improves test performance. This means that these simple models attain test performance that is at least that of the more complex (often black box) models from the larger function class they used initially.

In this work, we provide the mathematics and empirical evidence needed to establish this path, focusing on Steps 1, 2, and 4 because Step 3 follows directly (however, we provide empirical evidence for Step 3 as well). Moreover, for the case of ridge regression with additive attribute noise, we prove directly that adding noise to the dataset results in an increased Rashomon ratio. Specifically, the additive noise acts as $ℓ_{2}$ -regularization, thus it reduces the complexity of the hypothesis space (Step 3) and causes the Rashomon ratio to grow (Step 4).

Even if the analyst does not reduce the hypothesis space in Step 3, noise still gives us larger Rashomon sets. We show this by introducing pattern diversity, the average Hamming distance between all classification patterns produced by models in the Rashomon set. We show that under increased label noise, the pattern diversity tends to increase, which implies that when there is more noise, there are more differences in model predictions, and thus, there could be more models in the Rashomon set. Hence, a much shorter version of the path also works: Noise in the world causes an increase in pattern diversity, which means there are more diverse models in the Rashomon set, including simple ones.

It is becoming increasingly common to demand interpretable models for high-stakes decision domains (criminal justice, healthcare, etc.) for policy reasons such as fairness or transparency. Our work is possibly the first to show that the noise inherent in many such domains leads to technical justifications for demanding such models.

2. Related Work

Rashomon set.

The Rashomon set, named after the Rashomon effect coined by Leo Breiman [7], is based on the observation that often there are many equally good explanations of the data. When these are contradictory, the Rashomon effect gives rise to predictive multiplicity [30, 6, 18]. Rashomon sets have been used to study variable importance [15, 14, 42], for characterizing fairness [40, 9, 2], to improve robustness and generalization, especially under distributional shifts [37, 25], to study connections between multiplicity and counterfactual explanations [36, 53, 8], and to help in robust decision making [46]. Some works focused on trying to compute the Rashomon set for specific hypothesis spaces, such as sparse decision trees [52], generalized additive models [54], and decision lists [32]. Other works focus on near-optimality to find a diverse set of solutions to mixed integer problems [1], a set of targeted predictions under a Bayesian model [23], or estimate the Rashomon volume via approximating model in Reproducing Kernel Hilbert Space [31]. Black et al. [6] shows that the predictive multiplicity metric defined as expected pairwise disagreement increases with expected variance over the models in the Rashomon set. On the contrary, we focus on probabilistic variance in labels in the presence of noise.

Metrics of the Rashomon set.

To characterize the Rashomon set, multiple metrics have been proposed [39, 38, 30, 49, 18, 6]. The Rashomon ratio [39], and the pattern Rashomon ratio [38] measure the Rashomon set as a fraction of models or predictions within the hypothesis space; ambiguity and discrepancy [30, 49] indicate the number of samples that received conflicting estimates from models in the Rashomon set; Rashomon capacity [18] measures the Rashomon set for probabilistic outputs. Here, we focus on the Rashomon ratio and pattern Rashomon ratio. We also introduce pattern diversity. Pattern diversity is close to expected pairwise disagreement (as in Black et al. [6]), however, it uses unique classification patterns (see Appendix G).

Learning with noise.

Learning with noisy labels has been extensively studied [34], especially for linear regression [5] and, more recently, for neural networks [44] to understand and model effects of noise. Stochastic gradient descent with label noise acts as an implicit regularizer [12] and noise has been added to hidden units [35], labels [41], or covariances [50] to prevent overfitting in deep learning. When the labels are noisy, constructing robust loss [16], adding a slack variable for each training sample [19], or early stopping [27] also helps to improve generalization. In this work, we study why simpler models are often suitable for noisier datasets from the perspective of the Rashomon effect.

3. Notation and Definitions

Consider a training set of $n$ data points $S = \{z_{1}, z_{2}, \dots, z_{n}\}$ , such that each $z_{i} = (x_{i}, y_{i})$ is drawn i.i.d. from an unknown distribution $𝒟$ , where $𝒳 \subset R^{m}$ , and we have binary labels $𝒴 \in {- 1, 1}$ . Denote $ℱ$ as a hypothesis space, where $f \in ℱ$ obeys $f : 𝒳 \to 𝒴$ . Let $ϕ : 𝒴 \times 𝒴 \to R^{+}$ be a 0–1 loss function, where for point $z = (x, y)$ and hypothesis $f$ , the loss function is $ϕ (f (x), y) = 1_{[f (x) \neq y]}$ . Finally, let $\hat{L} (f)$ be an empirical risk $\hat{L} (f) = \frac{1}{n} \sum_{i = 1}^{n} ϕ (f (x_{i}), y_{i})$ , and let $\hat{f}$ be an empirical risk minimizer: $\hat{f} \in a r g {m i n}_{f \in ℱ} \hat{L} (f)$ . If we want to specify the dataset on which $\hat{f}$ was computed, we will indicate it by an index, ${\hat{f}}_{S}$ .

The Rashomon set contains all models that achieve near-optimal performance and can be defined as:

Definition 1 (Rashomon set). For dataset $S$ , a hypothesis space $ℱ$ , and a loss function $ϕ$ , given $θ \geq 0$ , the Rashomon set ${\hat{R}}_{set} (ℱ, θ)$ is:

{\hat{R}}_{set} (ℱ, θ) : = \{f \in ℱ : \hat{L} (f) \leq \hat{L} (\hat{f}) + θ\},

where $\hat{f}$ is an empirical risk minimizer for the training data $S$ with respect to loss function $ϕ : \hat{f} \in a r g {m i n}_{f \in ℱ} \hat{L} (f)$ , and $θ$ is the Rashomon parameter.

Rashomon parameter $θ$ determines the risk threshold $(\hat{L} (\hat{f}) + θ)$ , such that all models with risk lower than this threshold are inside the set. For instance, if we stay within 1% of the accuracy of the best model, then $θ = 0.01$ . Given parameter $γ > 0$ , we extend the definition to the true risk by defining the true Rashomon set, containing all models with a bound on true risk $R_{set} (ℱ, γ) = {f \in ℱ : L (f) \leq L (f^{*}) + γ\}$ , where $f^{*}$ is optimal model, $f^{*} = a r g {m i n}_{f \in ℱ} L (f)$ .

In this work, we study how noise influences the Rashomon set and choices that practitioners make in the presence of noise. We measure the Rashomon set in different ways, including the Rashomon ratio, pattern Rashomon ratio, and pattern diversity (defined in Section 6). For a discrete hypothesis space, the Rashomon ratio [39] is the ratio of the number of models in the Rashomon set compared to the hypothesis space, ${\hat{R}}_{ratio} (ℱ, θ) = \frac{|{\hat{R}}_{set} (ℱ, θ)|}{| ℱ |}$ , where |·| denotes cardinality. It is possible to weight the hypothesis space by a prior to define a weighted Rashomon ratio if desired.

Given a hypothesis $f$ and a dataset $S$ , a predictive pattern (or pattern) $p$ is the collection of outcomes from applying $f$ to each sample from $S : p^{f} = [f (x_{1}), \dots, f (x_{i}), \dots, f (x_{n})]$ . We say that pattern $p$ is achievable on the Rashomon set if there exists $f \in {\hat{R}}_{set} (ℱ, θ)$ such that $p^{f} = p$ . Let pattern Rashomon set $π (ℱ, θ) = \{p^{f} : f \in {\hat{R}}_{set} (ℱ, θ), p^{f} = {[f (x_{i})]}_{i = 1}^{n}\}$ be all unique patterns achievable by functions from ${\hat{R}}_{set} (ℱ, θ)$ on dataset $S$ . Finally, let $Ψ (ℱ)$ be the pattern hypothesis set, meaning that it contains all patterns achievable by models in the hypothesis space, $Ψ (ℱ) = \{p^{f} : f \in ℱ, p^{f} = {[f (x_{i})]}_{i = 1}^{n}\}$ . The pattern Rashomon ratio is the ratio of patterns in the pattern Rashomon set to the pattern hypothesis set: ${\hat{R}}_{ratio}^{pat} (ℱ, θ) = \frac{| π (ℱ, θ) |}{| Ψ (ℱ) |}$ .

In the following sections, we walk along the steps of our proposed path. Rather than trying to prove these points for every possible situation (which would be volumes beyond what we can handle here), we aim to find at least some way to illustrate that each step is reasonable in a natural setting.

4. Increase in Variance due to Noise Leads to Larger Rashomon Ratios

4.1. Step 1. Noise Increases Variance

One would think that something as simple as uniform label noise would not really affect anything in the learning process. In fact, we would expect that adding such noise would just uniformly increase the losses of all functions, and the Rashomon set would stay the same. However, this conclusion is (surprisingly) not true. Instead, noise adds variance to the loss, which, in turn, prevents us from generalizing.

For infinite data distribution $𝒟$ consider uniform label noise, where each label is flipped independently with probability $ρ < \frac{1}{2}$ . If $\tilde{y}$ is a flipped label, $P (\tilde{y} \neq y) = ρ$ . If the empirical risk of $f$ is over $\frac{1}{2}$ after adding noise, we transform $f$ to $- f$ . For a given model $f \in ℱ$ let $σ^{2} (f, 𝒟)$ be the variance of the loss, meaning that $σ^{2} (f, 𝒟) = {V a r}_{z ~ 𝒟} l (f, z)$ . We show in the following theorem that, for a given $f \in ℱ$ , label noise increases the variance of the loss.

Theorem 2 (Variance increases with label noise). Consider infinite true data distribution $𝒟$ , and uniform label noise, where each label is flipped independently with probability $ρ$ . Let $𝒟_{ρ}$ denote the noisy version of $𝒟$ . Consider 0–1 loss $l$ , and assume that there exists at least one function $\overline{f} \in ℱ$ such that $L_{𝒟} (\overline{f}) < \frac{1}{2} - γ$ . For a fixed $f \in ℱ$ , let $σ^{2} (f, 𝒟_{ρ})$ be the variance of the loss, $σ^{2} (f, 𝒟_{ρ}) = {V a r}_{z ~ 𝒟_{ρ}} l (f, z)$ on data distribution $𝒟_{ρ}$ . For any $0 < ρ_{1} < ρ_{2} < \frac{1}{2}$ ,

σ^{2} (f, 𝒟_{ρ_{1}}) < σ^{2} (f, 𝒟_{ρ_{2}}) .

The proof of Theorem 2 is in Appendix A This covers the uniform noise case, but variance increases more generally, and we prove this for several other common cases in Appendix A. More specifically, we show that the variance increases with other types of label noise, such as non-uniform label noise (see Theorem 12 in Appendix A) and margin noise (see Theorem 15 in Appendix A). For non-uniform label noise, for a sample $z = (x, y)$ , each label $y$ is flipped independently with probability $ρ_{x}$ , meaning that noise can depend on $x$ . This noise model is more realistic than uniform label noise and allows modeling of cases when one sub-population has much more noise than another. We model margin noise such as that which arises from high-dimensional Gaussians. Because of the central limit theorem, data often follow Gaussian distributions, therefore this noise is realistic and models mistakes near decision boundary. Label noise in datasets is common. In fact, real-world datasets reportedly have between 8.0% and 38.5% label noise [44, 43, 24, 28, 51]. We hypothesize that a significant amount of label noise in real-world datasets is a combination of Gaussian (due to the central limit theorem) and random noise (for example, because of clerical errors causing label noise).

For the true Rashomon set $R_{set} (ℱ, γ)$ , we consider the maximum variance for all models in the true set: $σ^{2} = {s u p}_{f \in R_{s e t} (ℱ, γ)} {V a r}_{z ~ 𝒟} l (f, z)$ . Then, from Theorem 2 we have that maximum expected variance over the Rashomon set increases with noise.

Corollary 3 (Maximum variance increases with label noise). Under the same assumptions as in Theorem 2, we have that

\underset{f \in R_{{set}_{𝒟_{ρ_{1}}}} (ℱ, γ)}{s u p} σ^{2} (f, 𝒟_{ρ_{1}}) < \underset{f \in R_{{set}_{𝒟_{ρ_{2}}}} (ℱ, γ)}{s u p} σ^{2} (f, 𝒟_{ρ_{2}}) .

The next step is to show that this increased maximum variance leads to worse generalization.

4.2. Step 2. Higher Variance Leads to Worse Generalization

Here we use an argument based on generalization bounds. Generalization bounds have been the key theoretical motivation for much of machine learning, including support vector machines (SVMs), because the margin that SVMs optimize appears in a bound. While bounds themselves are not directly used in practice, the terms in the bounds tend to be important quantities in practice. Our bound cannot be calculated in practice because it uses population information on the right side, but it still provides insight and motivation.

Unlike standard bounds, we will use the fact that the user is using empirical risk minimization, and cross-validation to assess overfitting. Thus, for $\hat{f}$ , there are two possibilities: $\hat{f}$ is in the true Rashomon set, or it is not. If it is not, then for the empirical risk minimizer $\hat{f}$ , the difference between the true and the empirical risk must be at least $γ$ , which will be detected with high probability in cross-validation [20, 33]. In that case, the user will reduce their hypothesis space and we move to Step 3. If $\hat{f}$ is in the true Rashomon set, it obeys the following bound.

Theorem 4 (Variance-based “generalization bound”). Consider dataset $S$ , 0–1 loss $l$ , and finite hypothesis space $ℱ$ . With probability at least $1 - δ$ , we have that for every $f \in R_{set} (ℱ, γ)$ :

L (f) - \hat{L} (f) \leq \frac{2}{3 n} log (\frac{|R_{set} (ℱ, γ)|}{δ}) + \sqrt{\frac{2 σ^{2}}{n} log (\frac{|R_{set} (ℱ, γ)|}{δ})},

(1)

where $σ^{2} = {s u p}_{f \in R_{s e t} (ℱ, γ)} {V a r}_{z ~ 𝒟} l (f, z)$ , and $n$ is number of samples in $S = {\{z_{i}\}}_{i = 1}^{n} ~ 𝒟$ .

The proof of Theorem 4 is in Appendix C. Note that generalization bounds are usually based on Hoeffding’s inequality (see Lemma 17), which is a special case of Bernstein’s inequality (see Lemma 16). In fact, we show in Appendix B that Bernstein’s inequality, which we used to prove Theorem 4, can be sharper than Hoeffding’s when the variance is less than $σ_{f}^{2} < \frac{1}{12}$ for a given $f$ . Theorem 4 is easily generalized to continuous hypothesis spaces through a covering argument over the true Rashomon set (as an example, see Theorem 20), where the complexity is measured as the size of the cover over the true Rashomon set instead of the number of models in the true Rashomon set.

Let $c (ℱ, n) = \frac{2}{3 n} log (\frac{|R_{set} (ℱ, γ)|}{δ})$ , which is the first term in the bound (1) in Theorem 4. According to Theorem 8 in Semenova et al. [39] under random label noise, the true Rashomon set does not decrease in size. Therefore, $c (ℱ, n)$ at least does not decrease with more noise as it depends only on complexity. However, the second term $\sqrt{3 σ^{2} c (ℱ, n)}$ depends on the maximum loss variance, which increases with label noise, as motivated in the previous section. This means with more noise in the labels, we would expect worse generalization, which would generally lead practitioners who are using a validation set to reduce the complexity of the hypothesis space.

As discussed earlier, we use cross-validation to assess whether $\hat{f}$ overfits. From Proposition 5 (proved in Appendix D) we infer that if the empirical risk minimizer (ERM) does not highly overfit, it has a high chance to be in the true Rashomon set and thus Theorem 4 applies.

Proposition 5 (ERM can be close to the true Rashomon set). Assume that through the cross-validation process, we can assess $ξ$ such that $L (\hat{f}) - \hat{L} (\hat{f}) \leq ξ$ with high probability (at least $1 - ϵ_{ξ}$ ) with respect to the random draw of data. Then, for any $ϵ > 0$ , with probability at least $1 - e^{- 2 n ϵ^{2}} - ϵ_{ξ}$ with respect to the random draw of training data, when $ξ + ϵ \leq γ$ , then $\hat{f} \in R_{set} (ℱ, γ)$ .

4.3. Step 3. Practitioner Chooses a Simpler Hypothesis Space

We have shown earlier that noisier datasets lead to higher variance and worse generalization. The question we consider here is whether one can see the results of these bounds in practice and would actually reduce the hypothesis space. For example, consider four real-world datasets and the hypothesis space of decision trees of various depths. In Figure 1 (a) we show that as label noise increases, so does the gap between risks (1– accuracy) evaluated on the training set and on a hold out dataset for a fixed depth tree, and thus, as a result, during the validation process, the smaller depth would be chosen by a reasonable analyst. We simulate this by using cross validation to select the optimal tree depth in CART, as shown in Figure 1 (b). As predicted, the optimal tree depth decreases as noise increases. We also show that a similar trend happens for the gradient boosted trees in Figure 7 in Appendix K.2. More specifically, with more noise, the best number of estimators, as chosen based on cross-validation, decreases. We describe the setup in detail in Appendix K.2.

Figure 1: — Practitioner’s validation process in the presence of noise for CART. For a fixed tree depth, as we add noise, the gap between training and validation accuracy increases (Subfigure a). As we use cross validation to select tree depth, the best tree depth decreases with noise (Subfigure b).

4.4. Step 4. Rashomon Ratio is Larger for Simpler Spaces

For simpler hypothesis spaces, it may not be immediately obvious that the Rashomon ratio is larger, i.e., a larger fraction of a simpler model class consists of “good” models. Intuitively, this is because the denominator of the ratio (the total number of models) increases faster than the numerator (the number of good models) as the complexity of the model class increases. As we will see, this is because the good models in the simpler model class tend to give rise to more bad models in the more complex class, and the bad models do not tend to give rise to good models as much. We explore two popular model classes: decision trees and linear models. The results thus extend to forests (collections of trees) and generalized additive models (which are linear models in enhanced feature space).

Consider data that live on a hypercube (e.g., the data have been binarized, which is a common pre-processing step [29, 3, 52, 47]) and a hypothesis space of fully grown trees (complete trees with the last level also filled) of a given depth $d$ . Denote this hypothesis space as $ℱ_{d}$ . For example, a depth 1 tree has 1 node and 2 leaves. Under natural assumptions on the quality of features of classifiers and the purity of the leaves of trees in the Rashomon set, we show that the Rashomon ratio is larger for hypothesis spaces of smaller-depth trees.

Proposition 6 (Rashomon ratio is larger for decision trees of smaller depth). For a dataset $S = X \times Y$ with binary feature matrix $X \in {0, 1}^{n \times m}$ , consider a hypothesis space $ℱ_{d}$ of fully grown trees of depth $d$ . Let the number of dimensions $m < 2^{2^{d}}$ . Assume: (Leaves are correct) all leaves in all trees in the Rashomon set have at least $⌈ θ n ⌉$ more correctly classified points than incorrectly classified points; (Bad features) there is a set of $m_{b a d} \geq d$ “bad” features such that the empirical risk minimizer of models using only the bad features is not in the Rashomon set. Then ${\hat{R}}_{ratio} (ℱ_{d + 1}, θ) < {\hat{R}}_{ratio} (ℱ_{d}, θ)$ .

The proof of Proposition 6 is in Appendix E. Both assumptions are typically satisfied in practice.

To demonstrate our point, we computed the Rashomon ratio and pattern Rashomon ratio for 19 different datasets for hypothesis spaces of decision trees and linear models of different complexity (see Figure 2). As the complexity of the hypothesis space increases, we see an obvious decrease in the Rashomon ratio and pattern Rashomon ratio.

Figure 2: — Calculation showing that the Rashomon ratio (a) and pattern Rashomon ratio (b) decrease for the hypothesis space of decision trees of fixed depth from 1 to 7 for 14 different datasets (a) and for the hypothesis space of linear models of sparsity from 1 to 4 for 6 different datasets (b). Each line represents a different dataset, each dot represents the log of the Rashomon ratio or pattern Rashomon ratio. Both ratios decrease as we move to a more complex hypothesis space.

To compute the Rashomon ratio, for trees, we use TreeFARMS [52], which allows us to enumerate the whole Rashomon set for sparse trees. For linear models, we design a two-step approach that allows us to compute all patterns in the Rashomon set. First, for every sample $z_{i}$ we solve a simple optimization problem that checks if there exists a model in the Rashomon set that misclassifies this sample. If there is no such model, changing the label of $z_{i}$ will not add more patterns to the Rashomon set, therefore, we can ignore $z_{i}$ . In the second step, we grow a search tree and bound all paths that will lead to patterns outside of the Rashomon set. We describe our approach for pattern computation in detail in Appendix K.3 and discuss the experimental setup in Appendix K.4.

Completion of the Path.

After reducing the complexity of the hypothesis space in Step 3, the practitioner finds a larger Rashomon ratio for the newly chosen lower complexity hypothesis space according to Step 4. As a reminder of the thesis of Semenova et al. [39], with large Rashomon ratios, there are many good models, among which may exist even simpler models that perform well. Thus, the path, starting from noise, is a powerful way to explain what we see in practice, which is that simple models often perform well [17].

Note that to follow the identified path, the machine learning practitioner does not need to know the exact amount of noise. As long as they suspect some noise present in the dataset, the results of this paper apply, and the practitioner would expect a good performance from simpler models.

5. Rashomon Ratio for Ridge Regression Increases under Additive Attribute Noise

For linear regression, adding multiplicative or additive noise to the training data is known to be equivalent to regularizing the model parameters [5]. Moreover, the more noise is added, the stronger this regularization is. Thus, noise leads directly to Step 3 and a choice of a smaller (simpler) hypothesis space. Also, for ridge regression, the Rashomon volume (the numerator of the Rashomon ratio) can be computed directly [39] and depends on the regularization parameter. Building upon these results, we prove that noise leads to an increase of the Rashomon ratio (as in Step 4 of the path).

Given dataset $S$ , the ridge regression model is learned by minimizing the penalized sum of squared errors: $\hat{L} (ω) = {\hat{L}}_{L S} (ω) + C ω^{T} ω$ , where ${\hat{L}}_{L S} (ω) = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{T} ω - y_{i})}^{2}$ is the least squares loss, $C$ is a regularization parameter, and $ω \in R^{m}$ is a parameter vector for a linear model $f = ω^{T} x$ .

We will assume that there is a maximum loss value ${\hat{L}}_{max}$ , such that any linear model that has higher regularized loss than ${\hat{L}}_{max}$ is not being considered within the hypothesis space. For instance, an upper bound for ${\hat{L}}_{max}$ is the value of the loss at the model that is identically $\bar{0}$ , namely $\hat{L} (\bar{0}) = \frac{1}{n} \sum_{i} y_{i}^{2}$ . Thus, for every reasonable model $f = ω^{T} x, f \in ℱ \equiv \hat{L} (ω) \leq {\hat{L}}_{max}$ . On the other hand, the best possible value of the least squares loss is ${\hat{L}}_{L S} = 0$ , and therefore we get that $C w^{T} w \leq {\hat{L}}_{max}$ , or alternatively, $w^{T} w \leq {\hat{L}}_{max} / C$ . This defines the hypothesis space as an $ℓ_{2}$ -norm ball in $m$ -dimensional space, the volume of which we can compute.

To measure the numerator of the Rashomon ratio we will use the Rashomon volume $𝒱 ({\hat{R}}_{set} (ℱ, θ))$ , as defined in Semenova et al. [39]. In the case of ridge regression, the Rashomon set is an ellipsoid in $m$ -dimensions, thus the Rashomon volume can be computed directly. Therefore, we have the Rashomon ratio as the ratio of the Rashomon volume to the volume of the $ℓ_{2}$ -norm ball that defines the hypothesis space.

Next, we show that under additive attribute noise, the Rashomon ratio increases:

Theorem 7 (Rashomon ratio increases with noise for ridge regression). Consider dataset $S = X \times Y, X$ is a non-zero matrix, and a hypothesis space of linear models $ℱ = \{f = ω^{T} x, ω \in R^{m}, ω^{T} ω \leq {\hat{L}}_{max} / C\}$ . Let $ϵ_{i}$ , such that $ϵ_{i} ~ 𝒩 (\bar{0}, λ I)$ ( $λ > 0$ , I is identity matrix), be i.i.d. noise vectors added to every sample: $x_{i}^{'} = x_{i} + ϵ_{i}$ . Consider options $λ_{1} > 0$ and $λ_{2} > 0$ that control how much noise we add to the dataset. For ridge regression, if $λ_{1} < λ_{2}$ , then the Rashomon ratios obey ${\hat{R}}_{{ratio}_{λ_{1}}} (ℱ, θ)) < {\hat{R}}_{{ratio}_{λ_{2}}} (ℱ, θ))$ .

The proof of Theorem 7 is in Appendix F. Note that while in Theorem 7 we directly show that adding noise to the training data is equivalent to stronger regularization leading us directly to Step 3, Steps 1 and 2 of the path identified in the previous section are automatically satisfied. We formally prove that additive noise still leads to an increase of the variance of losses for least squares loss (similar to Theorems 2, 12, and 15) in Theorem 19 in Appendix F, and we show that an increase in the maximum variance of losses leads to worse generalization bound (similar to Theorem 4) for the squared loss in Theorem 20 in Appendix F.

Returning to the Path.

For ridge regression, we have now built a direct noise-to-Rashomon-ratio argument showing that, in the presence of noise, the Rashomon ratios are larger. As before, for larger Rashomon ratios, there are multiple good models, including simpler ones that are easier to find.

6. Rashomon Set Characteristics in the Presence of Noise

Now we discuss a different mechanism for obtaining larger Rashomon sets. Suppose the practitioner knows the data are noisy. They would then expect a large Rashomon set, which we speculate in this section is explained by noise in the data. To show this, we define pattern diversity as a characteristic of the Rashomon set and show that it is likely to increase with label noise. Pattern diversity is an empirical measure of differences in patterns on the dataset. It computes the average distance between the patterns, which allows us not only to assess how large the Rashomon set is but also how diverse it is. Once the practitioner knows they have a large Rashomon set, they could hypothesize from the reasoning in Semenova et al. [39] that a simple model might perform well for their dataset.

6.1. Pattern Diversity: Definition, Properties and Upper Bound

Recall that $π (ℱ, θ)$ is the set of unique classification patterns produced by the Rashomon set of $ℱ$ with the Rashomon parameter $θ$ .

Definition 8 (Pattern diversity). For Rashomon set ${\hat{R}}_{set} (ℱ, θ)$ the pattern diversity div $({\hat{R}}_{set} (ℱ, θ))$ is defined as:

div ({\hat{R}}_{s e t} (ℱ, θ)) = \frac{1}{n Π Π} \sum_{\begin{matrix} j \leq Π \\ p_{j} ~ π (ℱ, θ) \end{matrix}} \sum_{\begin{matrix} k \leq Π \\ p_{k} ~ π (ℱ, θ) \end{matrix}} H (p_{j}, p_{k}),

where $n = | S |$ , $Π = | π (ℱ, θ) |$ , and $H (p_{j}, p_{k}) = \sum_{i = 1}^{n} 1_{[p_{j}^{i} \neq p_{k}^{i}]}$ is the Hamming distance (in our case it computes the number of samples at which predictions are different), and |·| denotes cardinality.

Pattern diversity measures pairwise differences between patterns of functions in the pattern Rashomon set. Pattern diversity is in the range [0, 1), where it is 0 if the pattern set contains one pattern or no patterns. Among different measures of the Rashomon set, the pattern diversity is the closest to the pattern Rashomon ratio [38] and expected pairwise disagreement (as in [6]). In Appendix G, we discuss similarities and differences between pattern diversity and these measures.

Given a sample $z_{i}$ , let $a_{i} = \frac{1}{Π} \sum_{k = 1}^{Π} 1_{[p_{k}^{i} = y_{i}]}$ , where $p_{k}^{i}$ is $i^{th}$ index of the $k^{th}$ pattern, denote the probability with which patterns from the pattern Rashomon set classify $z_{i}$ correctly. We will call $a_{i}$ sample agreement over the pattern Rashomon set. When $a_{i} = 1$ , then all patterns agreed and correctly classified $z_{i}$ . If $a_{i} = \frac{1}{2}$ , only half of the models were able to correctly predict the label. As we will show, when more samples have sample agreement near $\frac{1}{2}$ , we have higher pattern diversity. We can compute pattern diversity using average sample agreements instead of the Hamming distance according to the theorem below.

Theorem 9 (Pattern diversity via sample agreement). For 0–1 loss, dataset $S$ , and pattern Rashomon set $π (ℱ, θ)$ , pattern diversity can be computed as $div ({\hat{R}}_{set} (ℱ, θ)) = \frac{2}{n} \sum_{i = 1}^{n} a_{i} (1 - a_{i})$ , where $a_{i} = \frac{1}{Π} \sum_{k = 1}^{Π} 1_{[p_{k}^{i} = y_{i}]}$ is sample agreement over the pattern Rashomon set.

The proof of Theorem 9 is in Appendix H. In Theorem 23 in Appendix I, we show that average sample agreement (over all samples $z_{i}$ ) is inversely proportional to the average loss of the patterns from the pattern Rashomon set. Using this intuition, we can upper bound the pattern diversity by the empirical risk of the empirical risk minimizer and the Rashomon parameter $θ$ .

Theorem 10 (Upper bound on pattern diversity). Consider hypothesis space $ℱ$ , 0–1 loss, and empirical risk minimizer $\hat{f}$ . For any $θ \geq 0$ , pattern diversity can be upper bounded by

d i v ({\hat{R}}_{set} (ℱ, θ)) \leq 2 (\hat{L} (\hat{f}) + θ) (1 - (\hat{L} (\hat{f}) + θ)) + 2 θ .

(2)

The proof of Theorem 10 is in Appendix I. The bound (2) emphasizes how important the performance of the empirical risk minimizer is for understanding pattern diversity. If dataset is well separated so that the empirical risk is small, then pattern diversity will also be small, as there are not many different ways to misclassify points and stay within the Rashomon set. As dataset becomes noisier, in expectation, we expect the empirical risk to increase and thus pattern diversity as well. We will show theoretically and experimentally that pattern diversity is likely to increase under label noise.

6.2. Label Noise is Likely to Increase Pattern Diversity

Let $S_{ρ}$ be a version of $S$ with uniformly random label noise, creating randomly perturbed labels $\tilde{y}$ with probability $0 < ρ < \frac{1}{2} : P ({\tilde{y}}_{i} \neq y_{i}) = ρ$ . Let $Ω (S_{ρ})$ be the uniform distribution over all $S_{ρ}$ . From Theorem 10 denote the upper bound on pattern diversity as $U_{div} ({\hat{R}}_{set} (ℱ, θ)) = 2 (\hat{L} (\hat{f}) + θ) (1 - (\hat{L} (\hat{f}) + θ)) + 2 θ$ . We show that it increases with uniform label noise.

Theorem 11 (Upper bound on pattern diversity increases with label noise). Consider a hypothesis space $ℱ$ , 0–1 loss, and a dataset $S$ . Let $ρ \in (0, \frac{1}{2})$ be the probability with which each label $y_{i}$ is flipped independently, and let $S_{ρ} ~ Ω (S_{ρ})$ denote a noisy version of $S$ . For the Rashomon parameter $θ \geq 0$ , if ${i n f}_{f \in ℱ} E_{S_{ρ}} {\hat{L}}_{S_{ρ}} (f) < \frac{1}{2} - θ$ and ${\hat{L}}_{S} ({\hat{f}}_{S}) < \frac{1}{2}$ , then adding noise to the dataset increases the upper bound on pattern diversity of the expected Rashomon set:

U_{d i v} ({\hat{R}}_{{s e t}_{S}} (ℱ, θ)) < U_{d i v} ({\hat{R}}_{{set}_{E_{S_{ρ}} ~ Ω (S_{ρ})} S_{ρ}} (ℱ, θ)) .

The proof of Theorem 11 is in Appendix J. In the general case, it is challenging to find a closed-form formula for pattern diversity or design a lower bound without strong assumptions about the data distribution or the hypothesis space. Therefore, we empirically examine the behavior of pattern diversity alongside other characteristics of the Rashomon set for different datasets and show these characteristics tend to increase with more noise.

6.3. Experiment for Pattern Diversity and Label Noise

We expect many different datasets to have larger Rashomon set measurements as data become more noisy. As before, we consider uniform label noise, where each label is flipped independently with probability $ρ$ . For different noise levels, we compute diversity, and the number of patterns in the Rashomon set for 12 different datasets for the hypothesis space of sparse decision trees of depth 4 (see Figure 3 (a)–(b)). For every dataset, we introduce up to 25% label noise $(ρ \in [0, 0.25])$ , or Accuracy $(\hat{f}) - 50 %$ , whichever is smaller. This means that data with lower accuracy (before adding noise) will have shorter plots, since noise is along the horizontal axis of our plots in Figure 3. For each noise level $ρ$ , we do 25 draws of $S_{ρ}$ , and for each draw $S_{ρ}$ we recompute the Rashomon set. For decision trees we use TreeFARMS [52], which allows us to compute the number of trees in the Rashomon set. We discuss results for the hypothesis space of linear models in Appendix K.5.

Some key observations from Figure 3: First, the number of trees and patterns in the Rashomon set on average increases with noise. This means that the Rashomon ratio and pattern Rashomon ratio increase with noise as well since the hypothesis space stays the same. Second, noisier datasets, that initially had lower empirical risk (e.g., Compas, Coffee House, FICO) have more models in the Rashomon set (and thus higher Rashomon ratios) as compared to datasets with lower empirical risk (Car Evaluation, Monks1, Monks3). Finally, pattern diversity, on average, tends to increase with noise for the majority of the datasets, except FICO and Bar, where it stays about the same (most likely due to larger inherent noise in data before we added more noise).

Returning to the Path.

If the practitioner observes more noise in the data, it could be already the case that the Rashomon set is large. Then, there are many good models, among which simpler or interpretable model are likely to exist [39].

7. Limitations

While we believe our results have illuminated that the Rashomon effect often exists when data are noisy, the connection between noise and increased Rashomon metrics is mostly supported by experiments, rather than a tight set of theoretical bounds. Specifically, we do not have a lower bound on diversity nor a correlation between diversity and the pattern Rashomon ratio (or Rashomon ratio) yet. This connection deserves further exploration and strengthening.

In previous sections, we showed that we expect an increase in pattern diversity while adding more label noise. In Section 6.3, we also observe an increase in the number of trees and patterns in the Rashomon set under label noise. In both cases, we used 0–1 loss. While in Section 5 we studied ridge regression (and thus used squared loss), ways to extend our results (both experimental and theoretical) to different distance-based classification loss functions are not yet fully clear. In the case of classification with distance-based losses, such as exponential loss, the effect of label noise can be difficult to model without taking properties of the data distribution into account. For example, regardless of the amount of noise in the dataset, a single misclassified outlier could essentially define the Rashomon set. The exponential loss could be very sensitive to the position of this outlier, leading to a small Rashomon set. This does not apply to the analysis in this paper, since the 0–1 loss we used is robust to outliers. Perhaps techniques that visualize the loss landscape [26] can be helpful in characterizing the Rashomon set for distance-based classification losses.

Conclusion

Our results have profound policy implications, as they underscore the critical need to prioritize interpretable models in high-stakes decision-making. In a world where black box models are often used for high-stakes decisions, and yet the data generation processes are known to be noisy, our work sheds light on the false premise of this dangerous practice – that black box models are likely to be more accurate. Our findings have particular relevance for critical domains such as criminal justice, healthcare, and loan decisions, where individuals are subjected to the outputs of these models. The use of interpretable models in these areas can safeguard the rights and well-being of these individuals and ensure that decision-making processes are transparent, fair, and accountable.

Supplementary Material

NIHMS1981181-supplement-1.pdf^{(1.2MB, pdf)}

Acknowledgments

We gratefully acknowledge support from grants DOE DE-SC0023194, NIH/NIDA R01 DA054994, and NSF IIS-2130250.

References

[1].Ahanor Izuwa, Medal Hugh, and Trapp Andrew C. DiversiTree: A new method to efficiently compute diverse sets of near-optimal solutions to mixed-integer optimization problems. IN-FORMS Journal on Computing, 2023. [Google Scholar]
[2].Aïvodji Ulrich, Arai Hiromi, Gambs Sébastien, and Hara Satoshi. Characterizing the risk of fairwashing. In Advances in Neural Information Processing Systems, volume 34, pages 14822–14834, 2021. [Google Scholar]
[3].Angelino Elaine, Larus-Stone Nicholas, Alabi Daniel, Seltzer Margo, and Rudin Cynthia. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 35–44, 2017. [Google Scholar]
[4].Bartlett Peter, Freund Yoav, Lee Wee Sun, and Schapire Robert E. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [Google Scholar]
[5].Bishop Chris M. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995. [Google Scholar]
[6].Black Emily, Raghavan Manish, and Barocas Solon. Model multiplicity: Opportunities, concerns, and solutions. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 850–863, 2022. [Google Scholar]
[7].Breiman Leo. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231, 2001. [Google Scholar]
[8].Brunet Marc-Etienne, Anderson Ashton, and Zemel Richard. Implications of model indeterminacy for explanations of automated decisions. In Advances in Neural Information Processing Systems, volume 35, pages 7810–7823, 2022. [Google Scholar]
[9].Coston Amanda, Rambachan Ashesh, and Chouldechova Alexandra. Characterizing fairness over the set of good models under selective labels. In International Conference on Machine Learning, pages 2144–2155. PMLR, 2021. [Google Scholar]
[10].Cover Thomas M.. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput, 14:326–334, 1965. [Google Scholar]
[11].Cucker Felipe and Smale Steve. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002. [Google Scholar]
[12].Damian Alex, Ma Tengyu, and Lee Jason D. Label noise sgd provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, volume 34, pages 27449–27461, 2021. [Google Scholar]
[13].D’Amour Alexander, Heller Katherine, Moldovan Dan, Adlam Ben, Alipanahi Babak, Beutel Alex, Chen Christina, Deaton Jonathan, Eisenstein Jacob, Hoffman Matthew D, et al. Under-specification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297, 2022. [Google Scholar]
[14].Dong Jiayun and Rudin Cynthia. Exploring the cloud of variable importance for the set of all good models. Nature Machine Intelligence, 2(12):810–824, 2020. [Google Scholar]
[15].Fisher Aaron, Rudin Cynthia, and Dominici Francesca. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177):1–81, 2019. [PMC free article] [PubMed] [Google Scholar]
[16].Ghosh Aritra, Kumar Himanshu, and Sastry P Shanti. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [Google Scholar]
[17].Holte Robert C.. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–91, 1993. [Google Scholar]
[18].Hsu Hsiang and Calmon Flavio. Rashomon capacity: A metric for predictive multiplicity in classification. In Advances in Neural Information Processing Systems, volume 35, pages 28988–29000, 2022. [Google Scholar]
[19].Hu Wei, Li Zhiyuan, and Yu Dingli. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. [Google Scholar]
[20].Kearns Michael. A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. In Advances in Neural Information Processing Systems, volume 8, 1995. [Google Scholar]
[21].Khozeimeh Fahime, Alizadehsani Roohallah, Roshanzamir Mohamad, Khosravi Abbas, Layegh Pouran, and Nahavandi Saeid. An expert system for selecting wart treatment method. Computers in biology and medicine, 81:167–175, 2017. [DOI] [PubMed] [Google Scholar]
[22].Khozeimeh Fahime, Jabbari Azad Farahzad, Mahboubi Oskouei Yaghoub, Jafari Majid, Tehranian Shahrzad, Alizadehsani Roohallah, and Layegh Pouran. Intralesional immunotherapy compared to cryotherapy in the treatment of warts. International journal of dermatology, 56(4): 474–478, 2017. [DOI] [PubMed] [Google Scholar]
[23].Kowal Daniel R. Fast, optimal, and targeted predictions using parameterized decision analysis. Journal of the American Statistical Association, 117(540):1875–1886, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Lee Kuang-Huei, He Xiaodong, Zhang Lei, and Yang Linjun. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5447–5456, 2018. [Google Scholar]
[25].Lee Yoonho, Yao Huaxiu, and Finn Chelsea. Diversify and disambiguate: Out-of-distribution robustness via disagreement. In The Eleventh International Conference on Learning Representations, 2023. [Google Scholar]
[26].Li Hao, Xu Zheng, Taylor Gavin, Studer Christoph, and Goldstein Tom. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, volume 31, 2018. [Google Scholar]
[27].Li Mingchen, Soltanolkotabi Mahdi, and Oymak Samet. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020. [Google Scholar]
[28].Li Wen, Wang Limin, Li Wei, Agustsson Eirikur, and Van Gool Luc. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017. [Google Scholar]
[29].Lin Jimmy, Zhong Chudi, Hu Diane, Rudin Cynthia, and Seltzer Margo. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020. [Google Scholar]
[30].Marx Charles, Calmon Flavio, and Ustun Berk. Predictive multiplicity in classification. In International Conference on Machine Learning, pages 6765–6774. PMLR, 2020. [Google Scholar]
[31].Mason Blake, Jain Lalit, Mukherjee Subhojyoti, Camilleri Romain, Jamieson Kevin, and Nowak Robert. Nearly optimal algorithms for level set estimation. In International Conference on Artificial Intelligence and Statistics, pages 7625–7658. PMLR, 2022. [Google Scholar]
[32].Mata Kota, Kanamori Kentaro, and Arimura Hiroki. Computing the collection of good models for rule lists. arXiv preprint arXiv:2204.11285, 2022. [Google Scholar]
[33].Mukherjee Sayan, Niyogi Partha, Poggio Tomaso, and Rifkin Ryan. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25:161–193, 2006. [Google Scholar]
[34].Natarajan Nagarajan, Dhillon Inderjit S, Ravikumar Pradeep K, and Tewari Ambuj. Learning with noisy labels. In Advances in Neural Information Processing Systems, volume 26, 2013. [Google Scholar]
[35].Noh Hyeonwoo, You Tackgeun, Mun Jonghwan, and Han Bohyung. Regularizing deep neural networks by noise: Its interpretation and optimization. In Advances in Neural Information Processing Systems, volume 30, 2017. [Google Scholar]
[36].Pawelczyk Martin, Broelemann Klaus, and Kasneci Gjergji. On counterfactual explanations under predictive multiplicity. In Conference on Uncertainty in Artificial Intelligence, pages 809–818. PMLR, 2020. [Google Scholar]
[37].Ross Andrew, Pan Weiwei, Celi Leo, and Doshi-Velez Finale. Ensembles of locally independent prediction models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5527–5536, 2020. [Google Scholar]
[38].Rudin Cynthia, Chen Chaofan, Chen Zhi, Huang Haiyang, Semenova Lesia, and Zhong Chudi. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys, 16:1–85, 2022. [Google Scholar]
[39].Semenova Lesia, Rudin Cynthia, and Parr Ronald. On the existence of simpler machine learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022. [Google Scholar]
[40].Shahin Shamsabadi Ali, Yaghini Mohammad, Dullerud Natalie, Wyllie Sierra, Aïvodji Ulrich, Alaagib Aisha, Gambs Sébastien, and Papernot Nicolas. Washing the unwashable : On the (im)possibility of fairwashing detection. In Advances in Neural Information Processing Systems, volume 35, pages 14170–14182, 2022. [Google Scholar]
[41].Shallue Christopher J, Lee Jaehoon, Antognini Joseph, Sohl-Dickstein Jascha, Frostig Roy, and Dahl George E. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018. [Google Scholar]
[42].Smith Gavin, Mansilla Roberto, and Goulding James. Model class reliance for random forests. In Advances in Neural Information Processing Systems, volume 33, pages 22305–22315, 2020. [Google Scholar]
[43].Song Hwanjun, Kim Minseok, and Lee Jae-Gil. Selfie: Refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pages 5907–5915. PMLR, 2019. [Google Scholar]
[44].Song Hwanjun, Kim Minseok, Park Dongmin, Shin Yooju, and Lee Jae-Gil. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. [DOI] [PubMed] [Google Scholar]
[45].Teney Damien, Peyrard Maxime, and Abbasnejad Ehsan. Predicting is not understanding: Recognizing and addressing underspecification in machine learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, pages 458–476. Springer, 2022. [Google Scholar]
[46].Tulabandhula Theja and Rudin Cynthia. Robust optimization using machine learning for uncertainty sets. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2014. [Google Scholar]
[47].Verwer Sicco and Zhang Yingqian. Learning decision trees with flexible constraints and objectives using integer optimization. In Integration of AI and OR Techniques in Constraint Programming: 14th International Conference, CPAIOR 2017, Padua, Italy, June 5–8, 2017, Proceedings 14, pages 94–103. Springer, 2017. [Google Scholar]
[48].Wang Zijie J, Zhong Chudi, Xin Rui, Takagi Takuya, Chen Zhi, Chau Duen Horng, Rudin Cynthia, and Seltzer Margo. TimberTrek: Exploring and curating sparse decision trees with interactive visualization. In 2022 IEEE Visualization and Visual Analytics (VIS), pages 60–64. IEEE, 2022. [Google Scholar]
[49].Watson-Daniels Jamelle, Parkes David C, and Ustun Berk. Predictive multiplicity in probabilistic classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10306–10314, 2023. [Google Scholar]
[50].Wen Yeming, Luk Kevin, Gazeau Maxime, Zhang Guodong, Chan Harris, and Ba Jimmy. An empirical study of large-batch stochastic gradient descent with structured covariance noise. arXiv preprint arXiv:1902.08234, 2019. [Google Scholar]
[51].Xiao Tong, Xia Tian, Yang Yi, Huang Chang, and Wang Xiaogang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015. [Google Scholar]
[52].Xin Rui, Zhong Chudi, Chen Zhi, Takagi Takuya, Seltzer Margo, and Rudin Cynthia. Exploring the whole rashomon set of sparse decision trees. In Advances in Neural Information Processing Systems, volume 35, pages 14071–14084, 2022. [PMC free article] [PubMed] [Google Scholar]
[53].Yan Tom and Zhang Chicheng. Margin-distancing for safe model explanation. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 5104–5134. PMLR, 28–30 Mar 2022. [Google Scholar]
[54].Zhong Chudi, Chen Zhi, Liu Jiachang, Seltzer Margo, and Rudin Cynthia. Exploring and interacting with the set of good sparse generalized additive models. Advances in Neural Information Processing Systems, 2023. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1981181-supplement-1.pdf^{(1.2MB, pdf)}

[R1] [1].Ahanor Izuwa, Medal Hugh, and Trapp Andrew C. DiversiTree: A new method to efficiently compute diverse sets of near-optimal solutions to mixed-integer optimization problems. IN-FORMS Journal on Computing, 2023. [Google Scholar]

[R2] [2].Aïvodji Ulrich, Arai Hiromi, Gambs Sébastien, and Hara Satoshi. Characterizing the risk of fairwashing. In Advances in Neural Information Processing Systems, volume 34, pages 14822–14834, 2021. [Google Scholar]

[R3] [3].Angelino Elaine, Larus-Stone Nicholas, Alabi Daniel, Seltzer Margo, and Rudin Cynthia. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 35–44, 2017. [Google Scholar]

[R4] [4].Bartlett Peter, Freund Yoav, Lee Wee Sun, and Schapire Robert E. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [Google Scholar]

[R5] [5].Bishop Chris M. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995. [Google Scholar]

[R6] [6].Black Emily, Raghavan Manish, and Barocas Solon. Model multiplicity: Opportunities, concerns, and solutions. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 850–863, 2022. [Google Scholar]

[R7] [7].Breiman Leo. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231, 2001. [Google Scholar]

[R8] [8].Brunet Marc-Etienne, Anderson Ashton, and Zemel Richard. Implications of model indeterminacy for explanations of automated decisions. In Advances in Neural Information Processing Systems, volume 35, pages 7810–7823, 2022. [Google Scholar]

[R9] [9].Coston Amanda, Rambachan Ashesh, and Chouldechova Alexandra. Characterizing fairness over the set of good models under selective labels. In International Conference on Machine Learning, pages 2144–2155. PMLR, 2021. [Google Scholar]

[R10] [10].Cover Thomas M.. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput, 14:326–334, 1965. [Google Scholar]

[R11] [11].Cucker Felipe and Smale Steve. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002. [Google Scholar]

[R12] [12].Damian Alex, Ma Tengyu, and Lee Jason D. Label noise sgd provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, volume 34, pages 27449–27461, 2021. [Google Scholar]

[R13] [13].D’Amour Alexander, Heller Katherine, Moldovan Dan, Adlam Ben, Alipanahi Babak, Beutel Alex, Chen Christina, Deaton Jonathan, Eisenstein Jacob, Hoffman Matthew D, et al. Under-specification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297, 2022. [Google Scholar]

[R14] [14].Dong Jiayun and Rudin Cynthia. Exploring the cloud of variable importance for the set of all good models. Nature Machine Intelligence, 2(12):810–824, 2020. [Google Scholar]

[R15] [15].Fisher Aaron, Rudin Cynthia, and Dominici Francesca. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177):1–81, 2019. [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Ghosh Aritra, Kumar Himanshu, and Sastry P Shanti. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [Google Scholar]

[R17] [17].Holte Robert C.. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–91, 1993. [Google Scholar]

[R18] [18].Hsu Hsiang and Calmon Flavio. Rashomon capacity: A metric for predictive multiplicity in classification. In Advances in Neural Information Processing Systems, volume 35, pages 28988–29000, 2022. [Google Scholar]

[R19] [19].Hu Wei, Li Zhiyuan, and Yu Dingli. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. [Google Scholar]

[R20] [20].Kearns Michael. A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. In Advances in Neural Information Processing Systems, volume 8, 1995. [Google Scholar]

[R21] [21].Khozeimeh Fahime, Alizadehsani Roohallah, Roshanzamir Mohamad, Khosravi Abbas, Layegh Pouran, and Nahavandi Saeid. An expert system for selecting wart treatment method. Computers in biology and medicine, 81:167–175, 2017. [DOI] [PubMed] [Google Scholar]

[R22] [22].Khozeimeh Fahime, Jabbari Azad Farahzad, Mahboubi Oskouei Yaghoub, Jafari Majid, Tehranian Shahrzad, Alizadehsani Roohallah, and Layegh Pouran. Intralesional immunotherapy compared to cryotherapy in the treatment of warts. International journal of dermatology, 56(4): 474–478, 2017. [DOI] [PubMed] [Google Scholar]

[R23] [23].Kowal Daniel R. Fast, optimal, and targeted predictions using parameterized decision analysis. Journal of the American Statistical Association, 117(540):1875–1886, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Lee Kuang-Huei, He Xiaodong, Zhang Lei, and Yang Linjun. Cleannet: Transfer learning for scalable image classifier training with label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5447–5456, 2018. [Google Scholar]

[R25] [25].Lee Yoonho, Yao Huaxiu, and Finn Chelsea. Diversify and disambiguate: Out-of-distribution robustness via disagreement. In The Eleventh International Conference on Learning Representations, 2023. [Google Scholar]

[R26] [26].Li Hao, Xu Zheng, Taylor Gavin, Studer Christoph, and Goldstein Tom. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, volume 31, 2018. [Google Scholar]

[R27] [27].Li Mingchen, Soltanolkotabi Mahdi, and Oymak Samet. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020. [Google Scholar]

[R28] [28].Li Wen, Wang Limin, Li Wei, Agustsson Eirikur, and Van Gool Luc. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017. [Google Scholar]

[R29] [29].Lin Jimmy, Zhong Chudi, Hu Diane, Rudin Cynthia, and Seltzer Margo. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020. [Google Scholar]

[R30] [30].Marx Charles, Calmon Flavio, and Ustun Berk. Predictive multiplicity in classification. In International Conference on Machine Learning, pages 6765–6774. PMLR, 2020. [Google Scholar]

[R31] [31].Mason Blake, Jain Lalit, Mukherjee Subhojyoti, Camilleri Romain, Jamieson Kevin, and Nowak Robert. Nearly optimal algorithms for level set estimation. In International Conference on Artificial Intelligence and Statistics, pages 7625–7658. PMLR, 2022. [Google Scholar]

[R32] [32].Mata Kota, Kanamori Kentaro, and Arimura Hiroki. Computing the collection of good models for rule lists. arXiv preprint arXiv:2204.11285, 2022. [Google Scholar]

[R33] [33].Mukherjee Sayan, Niyogi Partha, Poggio Tomaso, and Rifkin Ryan. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25:161–193, 2006. [Google Scholar]

[R34] [34].Natarajan Nagarajan, Dhillon Inderjit S, Ravikumar Pradeep K, and Tewari Ambuj. Learning with noisy labels. In Advances in Neural Information Processing Systems, volume 26, 2013. [Google Scholar]

[R35] [35].Noh Hyeonwoo, You Tackgeun, Mun Jonghwan, and Han Bohyung. Regularizing deep neural networks by noise: Its interpretation and optimization. In Advances in Neural Information Processing Systems, volume 30, 2017. [Google Scholar]

[R36] [36].Pawelczyk Martin, Broelemann Klaus, and Kasneci Gjergji. On counterfactual explanations under predictive multiplicity. In Conference on Uncertainty in Artificial Intelligence, pages 809–818. PMLR, 2020. [Google Scholar]

[R37] [37].Ross Andrew, Pan Weiwei, Celi Leo, and Doshi-Velez Finale. Ensembles of locally independent prediction models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5527–5536, 2020. [Google Scholar]

[R38] [38].Rudin Cynthia, Chen Chaofan, Chen Zhi, Huang Haiyang, Semenova Lesia, and Zhong Chudi. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys, 16:1–85, 2022. [Google Scholar]

[R39] [39].Semenova Lesia, Rudin Cynthia, and Parr Ronald. On the existence of simpler machine learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022. [Google Scholar]

[R40] [40].Shahin Shamsabadi Ali, Yaghini Mohammad, Dullerud Natalie, Wyllie Sierra, Aïvodji Ulrich, Alaagib Aisha, Gambs Sébastien, and Papernot Nicolas. Washing the unwashable : On the (im)possibility of fairwashing detection. In Advances in Neural Information Processing Systems, volume 35, pages 14170–14182, 2022. [Google Scholar]

[R41] [41].Shallue Christopher J, Lee Jaehoon, Antognini Joseph, Sohl-Dickstein Jascha, Frostig Roy, and Dahl George E. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018. [Google Scholar]

[R42] [42].Smith Gavin, Mansilla Roberto, and Goulding James. Model class reliance for random forests. In Advances in Neural Information Processing Systems, volume 33, pages 22305–22315, 2020. [Google Scholar]

[R43] [43].Song Hwanjun, Kim Minseok, and Lee Jae-Gil. Selfie: Refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pages 5907–5915. PMLR, 2019. [Google Scholar]

[R44] [44].Song Hwanjun, Kim Minseok, Park Dongmin, Shin Yooju, and Lee Jae-Gil. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. [DOI] [PubMed] [Google Scholar]

[R45] [45].Teney Damien, Peyrard Maxime, and Abbasnejad Ehsan. Predicting is not understanding: Recognizing and addressing underspecification in machine learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII, pages 458–476. Springer, 2022. [Google Scholar]

[R46] [46].Tulabandhula Theja and Rudin Cynthia. Robust optimization using machine learning for uncertainty sets. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2014. [Google Scholar]

[R47] [47].Verwer Sicco and Zhang Yingqian. Learning decision trees with flexible constraints and objectives using integer optimization. In Integration of AI and OR Techniques in Constraint Programming: 14th International Conference, CPAIOR 2017, Padua, Italy, June 5–8, 2017, Proceedings 14, pages 94–103. Springer, 2017. [Google Scholar]

[R48] [48].Wang Zijie J, Zhong Chudi, Xin Rui, Takagi Takuya, Chen Zhi, Chau Duen Horng, Rudin Cynthia, and Seltzer Margo. TimberTrek: Exploring and curating sparse decision trees with interactive visualization. In 2022 IEEE Visualization and Visual Analytics (VIS), pages 60–64. IEEE, 2022. [Google Scholar]

[R49] [49].Watson-Daniels Jamelle, Parkes David C, and Ustun Berk. Predictive multiplicity in probabilistic classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10306–10314, 2023. [Google Scholar]

[R50] [50].Wen Yeming, Luk Kevin, Gazeau Maxime, Zhang Guodong, Chan Harris, and Ba Jimmy. An empirical study of large-batch stochastic gradient descent with structured covariance noise. arXiv preprint arXiv:1902.08234, 2019. [Google Scholar]

[R51] [51].Xiao Tong, Xia Tian, Yang Yi, Huang Chang, and Wang Xiaogang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015. [Google Scholar]

[R52] [52].Xin Rui, Zhong Chudi, Chen Zhi, Takagi Takuya, Seltzer Margo, and Rudin Cynthia. Exploring the whole rashomon set of sparse decision trees. In Advances in Neural Information Processing Systems, volume 35, pages 14071–14084, 2022. [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Yan Tom and Zhang Chicheng. Margin-distancing for safe model explanation. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 5104–5134. PMLR, 28–30 Mar 2022. [Google Scholar]

[R54] [54].Zhong Chudi, Chen Zhi, Liu Jiachang, Seltzer Margo, and Rudin Cynthia. Exploring and interacting with the set of good sparse generalized additive models. Advances in Neural Information Processing Systems, 2023. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Path to Simpler Models Starts With Noise

Lesia Semenova

Harry Chen

Ronald Parr

Cynthia Rudin

Abstract

1. Introduction

2. Related Work

Rashomon set.

Metrics of the Rashomon set.

Learning with noise.

3. Notation and Definitions

4. Increase in Variance due to Noise Leads to Larger Rashomon Ratios

4.1. Step 1. Noise Increases Variance

4.2. Step 2. Higher Variance Leads to Worse Generalization

4.3. Step 3. Practitioner Chooses a Simpler Hypothesis Space

Figure 1:

4.4. Step 4. Rashomon Ratio is Larger for Simpler Spaces

Figure 2:

Completion of the Path.

5. Rashomon Ratio for Ridge Regression Increases under Additive Attribute Noise

Returning to the Path.

6. Rashomon Set Characteristics in the Presence of Noise

6.1. Pattern Diversity: Definition, Properties and Upper Bound

6.2. Label Noise is Likely to Increase Pattern Diversity

6.3. Experiment for Pattern Diversity and Label Noise

Figure 3:

Returning to the Path.

7. Limitations

Conclusion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases