Robustifying Likelihoods by Optimistically Re-weighting Data

Miheer Dewaskar; Christopher Tosh; Jeremias Knoblauch; David B Dunson

doi:10.1080/01621459.2025.2468012

. Author manuscript; available in PMC: 2026 Apr 11.

Published before final editing as: J Am Stat Assoc. 2025 Apr 11:10.1080/01621459.2025.2468012. doi: 10.1080/01621459.2025.2468012

Robustifying Likelihoods by Optimistically Re-weighting Data

Miheer Dewaskar ^a,^*,^#, Christopher Tosh ^b,^*, Jeremias Knoblauch ^c, David B Dunson ^d

PMCID: PMC12459652 NIHMSID: NIHMS2061361 PMID: 41001515

Abstract

Likelihood-based inferences have been remarkably successful in wide-spanning application areas. However, even after due diligence in selecting a good model for the data at hand, there is inevitably some amount of model misspecification: outliers, data contamination or inappropriate parametric assumptions such as Gaussianity mean that most models are at best rough approximations of reality. A significant practical concern is that for certain inferences, even small amounts of model misspecification may have a substantial impact; a problem we refer to as brittleness. This article attempts to address the brittleness problem in likelihood-based inferences by choosing the most model friendly data generating process in a distance-based neighborhood of the empirical measure. This leads to a new Optimistically Weighted Likelihood (OWL), which robustifies the original likelihood by formally accounting for a small amount of model misspecification. Focusing on total variation (TV) neighborhoods, we study theoretical properties, develop estimation algorithms and illustrate the methodology in applications to mixture models and regression.

Keywords: Coarsened Bayes, Data contamination, Mixture models, Model misspecification, Outliers, Robust inference, Total variation distance

1. Introduction

When the likelihood is correctly specified, there is arguably no substitute for likelihood-based statistical inferences (see e.g. Zellner, 1988; Royall, 2017). However, if the likelihood is misspecified, inferences may be flawed (Huber, 1964; Tsou and Royall, 1995; Huber and Ronchetti, 2009; Hampel et al., 2005; Miller and Dunson, 2019). This has motivated methods for model comparison and goodness-of-fit assessment (see e.g. Huber-Carol et al., 2012; Claeskens and Hjort, 2008). A key is to verify that the assumed likelihood is consistent with the data at hand. Yet, even with substantial care in model assessment, some amount of model misspecification is inevitable.

Unfortunately, even slight model misspecification can have dire consequences in certain settings; a problem we refer to as brittleness. Brittleness can occur in various applications, including in high dimensional problems (e.g. Bradic et al., 2011; Bradic, 2016), in the presence of outliers and contaminating distributions (e.g. Huber, 1964; Huber and Ronchetti, 2009; Hampel et al., 2005), or for mixture models (e.g. Markatou, 2000; Chandra et al., 2023; Liu and Moitra, 2023; Cai et al., 2021). This article is focused on robustifying likelihoods to avert such brittleness.

Figure 1 illustrates brittleness and our proposed solution in the setting of model-based clustering with kernel mixture models. Here, most of the data are perfectly modeled by a mixture of two well-separated Gaussians. However, a small fraction of the data have been corrupted, and are instead drawn uniformly from an interval between the two modes. As the left panel demonstrates, the maximum likelihood estimate (MLE) is sensitive to the corrupted data. Our proposal is to instead maximize an optimistically weighted likelihood (OWL), which slightly perturbs the data to be more consistent with the assumed model. The left panel shows that maximizing the OWL reduces brittleness, while the right panel shows the corresponding optimistically re-weighted data.

Although in principle one can address the issue of misspecification by designing more flexible models, e.g. via mixture or nonparametric models (Taplin, 1993; Beath, 2018), such flexibility often comes at the expense of interpretability, identifiability, and efficiency, both computational and statistical. Moreover, there is always the risk that one’s model is not flexible enough and is therefore subject to the brittleness concerns outlined above. These issues motivate the core approach of the field of robust estimation: rather than designing a complicated model that fits all aspects of the data generating process, one instead fits a model that is only approximately correct, albeit with a different methodology. At the heart of many robust estimation methods is the concept of the weighted likelihood.

For a likelihood function $f$ , the weighted likelihood replaces $L (x_{1 : n} ∣ θ) = \prod_{i = 1}^{n} f (x_{i} ∣ θ)$ by the weighted counterpart

L_{w} (x_{1 : n} ∣ θ) = \prod_{i = 1}^{n} f {(x_{i} ∣ θ)}^{w_{i}}

(1)

for a collection of weights $w = {\{w_{i}\}}_{i = 1}^{n}$ that typically depend on $θ$ and the data. Estimation of $θ$ and $w$ often proceeds via iteratively reweighted maximum likelihood estimation (IRW-MLE) by repeating the following steps until convergence: a) maximize likelihood $θ \mapsto L_{w} (X_{1 : n} ∣ θ)$ based on the current $w$ and, b) update $w$ using the current $θ$ (Avella Medina and Ronchetti, 2015; Maronna et al., 2019). A variety of classical robust methods (Huber and Ronchetti, 2009; Hampel et al., 2005) fall in this framework, providing weights measuring the influence of each observation (e.g., Figure 8 in Mancini et al., 2005). Typical practice assumes $w_{i} = w (x_{i}, θ)$ with $w (\cdot)$ chosen so that $w_{i} \approx 0$ for observations having very low likelihood under the presumed model; e.g., $w (x, θ)$ is a function of the density $f (x ∣ θ)$ (Windham, 1995) or its cumulative distribution function (Field and Smith, 1994; Dupuis and Morgenthaler, 2002). See See Supplement S1 for more discussion about these methods.

Typical robust estimation methods focus on sensitivity to outliers, while our interest is in broader types of misspecification. In Figure S1 from Supplement S2, we illustrate how misspecification in the shape of the kernel can lead to errors in selecting the number of components in mixture models. We aim to develop a general weighted likelihood methodology that allows for model misspecification beyond outliers. We focus on robustness to a small amount of misspecification with respect to an appropriate metric $d$ in the space of probability distributions.

A well-studied approach to achieving robustness is minimum distance estimation (Wolfowitz, 1957; William, 1981), which attempts to find a model that minimizes the distance to the generating distribution with respect to the metric $d$ . While this method can be shown to be robust to misspecification in $d$ (Donoho and Liu, 1988), it is often challenging to compute (e.g. Yatracos, 1985) beyond suitable $ϕ$ -divergences (Lindsay, 1994). In continuous spaces, when $d$ is a $ϕ$ -divergence with smooth $ϕ$ , minimum distance estimation can be computed using weighted likelihood equations with $w (x, θ) = \frac{A (δ (x ∣ θ)) - A (- 1)}{δ (x ∣ θ) + 1}$ where $A$ is the residual adjustment function corresponding to $ϕ$ and $δ (x ∣ θ) = \hat{p} (x) / f (x ∣ θ) - 1$ is the Pearson residual based on a density estimate $\hat{p}$ of the observed data (Basu and Lindsay, 1994; Basu et al., 1998; Markatou et al., 1998; Greco and Agostinelli, 2020). Unfortunately, most $ϕ$ -divergences outside the Hellinger and total-variation (TV) cases are not metrics, making interpretation difficult. Moreover, the important case of TV distance is excluded by this work since the corresponding $ϕ$ is not smooth. Recent work (Cherief-Abdellatif and Alquier, 2020; Briol et al., 2019; Bernton et al., 2019) has provided tools for computing minimum distance estimators when $d$ is the maximum mean discrepancy (MMD) or Wasserstein distance, however this approach relies on generative models and does not make use of the likelihood.

Here, we develop a weighted likelihood methodology that can perform approximate minimum distance estimation, providing a recipe for robustifying Bayesian and frequentist approaches by re-weighting the likelihood (see Supplement S2). For non-negative weights $w = {\{w_{i}\}}_{i = 1}^{n}$ that sum to $n$ , our key observation is that the weighted likelihood (1) is an ordinary (i.e. un-weighted) likelihood for perturbed data represented by the weighted empirical distribution ${\hat{Q}}_{w} = n^{- 1} \sum_{i = 1}^{n} w_{i} δ_{x_{i}}$ , where $δ_{x}$ denotes the Dirac measure at $x$ . Thus IRW-MLE simultaneously finds: i) an optimal data perturbation ${\hat{Q}}_{w^{*}}$ , and ii) the ordinary maximum likelihood estimate $\hat{θ}$ based on the perturbed data ${\hat{Q}}_{w^{*}}$ . For approximate minimum distance estimation, we require the perturbations ${\hat{Q}}_{w}$ to lie in a $(d, ϵ)$ neighborhood around the empirical distribution of the observed data $\hat{P} = n^{- 1} \sum_{i = 1}^{n} δ_{x_{i}}$ for a choice of $ϵ > 0$ given by the user.

Our optimal weights $w^{*}$ are determined using the principle of optimism: in the absence of information other than constraint $d ({\hat{Q}}_{w}, \hat{P}) \leq ϵ$ , we choose the perturbation ${\hat{Q}}_{w}$ closest to the model family in the Kullback Leibler (KL) sense. If ${\{P_{θ}\}}_{θ \in Θ}$ denotes our model family and KL is an estimator for the KL divergence, we (Algorithm 1) adapt the IRW-MLE procedure to solve the following Optimistic Kullback Leibler (OKL) minimization:

\underset{θ \in Θ}{m i n} \underset{w : d ({\hat{Q}}_{w}, \hat{P}) \leq ϵ}{m i n} K L ({\hat{Q}}_{w} ∣ P_{θ}) .

(2)

When $ϵ = 0$ , OKL minimization is equivalent to the usual MLE, and the choice $ϵ > 0$ reflects the degree of misspecification. If $P_{0}$ is the true distribution generating $x_{1}, \dots, x_{n}$ , and $ϵ$ is greater than the degree of misspecification $ε_{0} ≐ {m i n}_{θ \in Θ} d (P_{0}, P_{θ})$ , then any global minimizer $\hat{θ}$ of (2) can be asymptotically shown to lie in the set $\{θ : d (P_{0}, P_{θ}) \leq ϵ\}$ (Proposition 1). Thus when $ϵ$ is tuned to approximate $ε_{0}$ (Section 2.2.3), $\hat{θ}$ will be a minimum distance estimator.

In this work, we show that the OKL is intimately related to the concept of the coarsened likelihood (Miller and Dunson, 2019) – a genuine likelihood that acknowledges small misspecification in terms of $d$ . While even evaluating the coarsened likelihood requires the computation of a high-dimensional integral, we use techniques from large deviation theory (Dembo and Zeitouni, 2010) to show that the coarsened likelihood is asymptotically equivalent to the OKL. Thus, the OWL methodology can be seen as a bridge, connecting minimum distance estimation and weighted likelihood methods on one side and the coarsened likelihood approach on the other.

When $d (Q, P) = \int l o g \frac{d Q}{d P} d Q$ is the KL divergence, the Lagrangian formulation of our OKL minimization problem corresponds to minimizing a Cressie-Read power-divergence between the observations and the model family (Ghosh and Basu, 2018). Since different computational strategies may be required across metrics, we focus on total variation (TV) distance for $d$ , which is easy to interpret, is a proper metric, and is robust to outliers and contamination (Donoho and Liu, 1988). Further, TV based estimators have been shown to attain minimax estimation rates under Huber’s contamination model (Chen et al., 2016).

Coarsened likelihoods (Miller and Dunson, 2019) in general, and the principle of optimism used here in particular, are related to statistical learning from fuzzy or imprecise data (Hüllermeier and Cheng, 2015; Hüllermeier, 2014); the key idea is to treat observed data as imprecise to facilitate robust learning (Lienen and Hüllermeier, 2021). Among various approaches to robust inference in regression (e.g. She and Owen, 2011; Rousseeuw and Leroy, 2005; Avella-Medina and Ronchetti, 2018), Bondell and Stefanski (2013) implements weighted least squares with weights learned within an empirical likelihood framework subject to a target weighted squared residual error. While OKL minimization (2) finds weights in a fashion similar to this and other empirical likelihood approaches (e.g. Choi et al., 2000), our optimization problem arises from large deviation formulas and leads to weights that are constrained to lie within a small total-variation $(ℓ_{1})$ distance of the uniform weight vector.

The remainder of this paper is structured as follows: Section 2 presents the OWL methodology, the associated alternating optimization scheme, and discusses some formal robustness guarantees. Section 3 demonstrates the asymptotic connection between OWL and coarsened inference. Section 4 presents a suite of simulation experiments for the OWL methodology in both regression and clustering tasks. Section 5 uses OWL to infer the average intent-to-treat effect in a micro-credit study (Angelucci et al., 2015), whose inference using ordinary least squares (OLS) was shown to be brittle to the removal of an handful of observations (Broderick et al., 2023). Finally, we provide a robust clustering application for single-cell RNAseq data in Supplement S12.

2. Optimistically Weighted Likelihoods

As Figure 1 shows, maximum likelihood estimation can be brittle when the data generating distribution $P_{0}$ lies outside, albeit suitably close, to the model family ${\{P_{θ}\}}_{θ \in Θ}$ . To assuage the problem of brittleness under misspecification, we propose an Optimistically Weighted Likelihood (OWL) approach that iterates between (1) an optimistic (or model-aligned) re-weighting of the observed data points and (2) updating the parameter estimate by maximizing a weighted likelihood based on the current data weights.

In Section 2.1, we study this parameter inference methodology at the population level where we formally allow $P_{0}$ to be misspecified as long as it is close to the model family in the total-variation (TV) distance. Here we introduce the population level Optimistic Kullback Leibler (OKL) function with parameter $ϵ \in [0,1]$ (Definition 3), and show that its minimizer will be a parameter $θ$ for which $P_{θ}$ is $ϵ$ -close to $P_{0}$ in TV distance. We provide formal robustness guarantees for these minimizers by analyzing a bias-distortion curve.

Motivated by this population analysis, in Section 2.2, we derive the OWL-based parameter estimation methodology when only samples $x_{1}, \dots, x_{n}$ from $P_{0}$ are available.

2.1. Population level Optimistic Kullback Leibler minimization

Let $𝒫 (𝒳)$ denote the space of probability distributions on the data space $𝒳$ . In principle, our methodology can accommodate misspecification in terms of a variety of probability metrics (e.g. MMD or Wasserstein) on $𝒫 (𝒳)$ , but here we mainly focus on the total variation (TV) distance for concreteness and interpretability. Let $d_{T V} (P, Q) = \underset{A \subseteq 𝒳}{s u p} | P (A) - Q (A) |$ denote the TV metric between two probability distributions $P, Q \in 𝒫 (𝒳)$ , where the supremum ranges over all measurable subsets $A$ of $𝒳$ . Given a model family ${\{P_{θ}\}}_{θ \in Θ} \subseteq 𝒫 (𝒳)$ , we assume that the data generating distribution $P_{0} \in 𝒫 (𝒳)$ for the data population in question satisfies $d_{T V} (P_{0}, P_{θ^{*}}) \leq ϵ$ for a known value $ϵ \geq 0$ and some unknown $θ^{*} \in Θ$ . In other words, we make the following assumption (satisfied under Huber’s $ϵ$ -contamination model (Huber, 1964) given by $P_{0} = (1 - ϵ) P_{θ^{*}} + ϵ C$ for any $C \in 𝒫 (𝒳)$ ).

Assumption 2.1.

Given $ϵ \geq 0$ and the true data distribution $P_{0}$ , the set of parameters $Θ_{I} = \{θ \in Θ ∣ d_{T V} (P_{0}, P_{θ}) \leq ϵ\}$ is non-empty.

In terms of Liu and Lindsay (2009), Assumption 2.1 requires that our model be adequate for the data at level $ϵ$ in terms of the TV neighborhood. While Liu and Lindsay (2009) develop a likelihood-ratio test for this assumption, here we focus on estimating parameters by maximizing a natural likelihood (Section 3), leading to a different optimization problem.

In general, under Assumption 2.1, it may only be possible to identify the set $Θ_{I}$ , rather than any particular $θ^{*} \in Θ$ . Although such indeterminacy may be inherent, it is practically insignificant whenever $ϵ$ is sufficiently small so that the distinction between two elements from $Θ_{I}$ is irrelevant (Huber, 1964). In line with this insight, the goal throughout the rest of the paper will be to identify some parameter in $Θ_{I}$ . Indeed, selecting $ϵ$ to approximate ${m i n}_{θ} d_{T V} (P_{0}, P_{θ})$ (see Section 2.2.3) will recover a minimizer of the map $θ \mapsto d_{T V} (P_{0}, P_{θ})$ .

At the population level, maximum likelihood parameter estimation amounts to minimizing the Kullback Leibler (KL) function $θ \mapsto K L (P_{0} ∣ P_{θ})$ on the parameter space $Θ$ . Even under small amounts of misspecification, KL minimizers are very brittle since any minimizer of the KL function must place sufficient probability mass wherever $P_{0}$ does, including on outliers. In contrast, TV distance is far less sensitive to outliers and slight shifts in the shape of the distribution. Hence one may minimize $θ \mapsto d_{T V} (P_{0}, P_{θ})$ as a robust alternative, particularly under Assumption 2.1. However, direct minimization of TV distance over the parameter space $Θ$ is difficult to implement in practice due to the lack of suitable optimization primitives (e.g. maximum likelihood estimators) and the non-convex and non-smooth nature of the optimization problem (see e.g. Yatracos, 1985).

Instead, we modify the KL loss by minimizing its first argument over the $ϵ$ -neighborhood of $P_{0}$ in the TV distance. The resulting function, which we term the Optimistic Kullback Leibler (OKL), is defined as follows.

Definition 2.1.

(Optimistic Kullback Leibler) Given $P_{0}$ and $ϵ > 0$ , the OKL function $I_{ϵ} : Θ \to [0, \infty]$ is defined as:

I_{ϵ} (θ) = {i n f}_{Q \in ℬ_{ϵ} (P_{0})} K L (Q ∣ P_{θ}),

(3)

where $ℬ_{ϵ} (P_{0}) = \{Q \in 𝒫 (𝒳) ∣ d_{T V} (P_{0}, Q) \leq ϵ\}$ is the TV ball of radius $ϵ$ around $P_{0}$ . If $I_{ϵ} (θ) < \infty$ , the underlying optimization over $ℬ_{ϵ} (P_{0})$ has a unique minimizer $Q^{θ}$ called the I-projection (Csiszár, 1975).

The OKL function $θ \mapsto I_{ϵ} (θ)$ measures the fit of a model $P_{θ}$ to the data $P_{0}$ , allowing for a degree $ε$ of data re-interpretation in the TV distance before assessing model fit. More precisely, since $I_{ϵ} (θ) = K L (Q^{θ} ∣ P_{θ})$ , the OKL is equal to the KL divergence between the optimistic (i.e. most model aligned) distribution $Q^{θ} \in ℬ_{ϵ} (P_{0})$ in the neighborhood of $P_{0}$ and the model $P_{θ}$ . Here, $ϵ \geq 0$ regulates the permitted degree of re-interpreting the data by controlling the neighborhood size.

The OKL function can be used to find a parameter from the set $Θ_{I}$ . Indeed, as illustrated in Figure 2 (right), the minimum value of zero for the OKL function is attained exactly on the set $Θ_{I}$ since $Q^{θ} = P_{θ}$ if and only if $θ \in Θ_{I}$ . However, the OKL can be non-convex as a function of $θ$ , so that calculating the global minimizer of OKL may not be straightforward. Fortunately, the OKL lends itself to a feasible alternating optimization scheme (Figure 2; left) that is guaranteed to decrease the OKL objective or reach a saddle point under regularity conditions.

We now summarize formal robustness guarantees for the global minimizers of the OKL function against perturbations of $P_{0}$ in small TV-neighborhoods around the model family; details can be found in Supplement S4. We call the mapping from $P_{0}$ to the global minimizers of the OKL function as the OWL functional $𝒯_{ϵ} : 𝒫 (𝒳) \to 2^{Θ}$ , defined as $𝒯_{ϵ} (P_{0}) ≐ a r g {m i n}_{θ \in Θ} I_{ϵ} (θ)$ where $2^{Θ}$ denotes the power-set of $Θ$ . Motivated by the robustness theory for minimum distance functionals (Donoho and Liu, 1988), when $P_{0} = P_{θ^{*}}$ lies on the model family, we bound the growth of a bias-distortion (BD) curve $δ \mapsto {B D}_{ϵ} (δ ∣ P_{0})$ defined as ${B D}_{ϵ} (δ ∣ P_{0}) ≐ \underset{\begin{matrix} P \in 𝒫 (𝒳) \\ d_{T V} (P, P_{0}) \leq δ \end{matrix}}{s u p} H a u s (𝒯_{ϵ} (P), 𝒯_{ϵ} (P_{0}))$ , where Haus denotes the Hausdorff distance on $2^{Θ}$ . Indeed, classically important robustness indicators like sensitivity and qualitative-robustness are properties about the growth of a suitable BD curve when $δ$ is close to zero, and the breakdown-point is the point at which the BD curve has a vertical asymptote (Donoho and Liu, 1988, Figure 1). Given this setup, under suitable regularity conditions, our two main results in Supplement S4 show that the OWL functional is qualitatively-robust with finite sensitivity (corollary of Lemma S8) and has a breakdown-point of at least $ϵ$ whenever $ϵ$ is less than half of the best breakdown-point for the model family among all Fisher consistent functionals (Lemma S8).

2.2. Optimistically Weighted Likelihood (OWL) estimation

Here we extend the population level methodology from Section 2.1 to handle the practical case when samples $x_{1}, \dots, x_{n} \overset{i.i.d .}{~} P_{0}$ are available, providing a computable approximation to the alternating procedure in Figure 2. Particularly, the I-projection step is now approximated by a convex optimization problem over weight vectors constrained to lie within the intersection of the $n$ -dimensional probability simplex and the $ℓ_{1}$ ball of radius $2 ϵ$ around the uniform probability vector; these optimal weights can be interpreted as an optimistic re-weighting of the original data points $x_{1}, \dots, x_{n}$ to match the current model estimate. Given these weights, a new parameter estimate is then found by maximizing the weighted likelihood. We call this the Optimistically Weighted Likelihood (OWL) method.

2.2.1. Approximating OKL by a finite dimensional optimization problem

Given observed data $x_{1}, \dots, x_{n} \overset{i.i.d .}{~} P_{0}$ , we start by approximating the OKL function $I_{ϵ} (θ)$ in terms of a finite dimensional optimization problem. Henceforth, let us assume that the model family ${\{P_{θ}\}}_{θ \in Θ}$ and measure $P_{0}$ have densities ${\{p_{θ}\}}_{θ \in Θ}$ and $p_{0}$ with respect to a common measure $λ$ . We will focus on two cases of interest: when $𝒳$ is a discrete space and $λ$ is the counting measure, and when $𝒳 = R^{d}$ and $λ$ is the Lebesgue measure.

When $𝒳$ is discrete, we look to solve the optimization problem in eq. (3) over data re-weighting $Q = \sum_{i = 1}^{n} w_{i} δ_{x_{i}}$ as the weight vector $w = (w_{1}, \dots, w_{n})$ varies over the $n$ -dimensional probability simplex $Δ_{n}$ and satisfies the TV constraint $\frac{1}{2} ‖ w - o ‖ 1 \leq ϵ$ where $o = (1 / n, \dots, 1 / n) \in Δ_{n}$ . Formally, our finite space OKL approximation is given by

{\hat{I}}_{ϵ, fin} (θ) = \underset{\begin{matrix} w \in Δ_{n} \\ \frac{1}{2} ‖ w - o ‖_{1} \leq ϵ \end{matrix}}{i n f} \sum_{i = 1}^{n} w_{i} l o g \frac{n w_{i} {\hat{p}}_{fin} (x_{i})}{p_{θ} (x_{i})},

(4)

where ${\hat{p}}_{fin} (y) = \frac{|\{i \in [n] ∣ x_{i} = y\}|}{n}$ is the histogram estimator for $p_{0}$ based on observed data. An application of the log sum inequality (Cover and Thomas, 2006, Theorem 2.7.1) shows that the weights that solve eq. (4) have the appealing and natural property that $w_{i} = w_{j}$ whenever $x_{i} = x_{j}$ . Moreover, when the support of $p_{0}$ is finite and contains the support of $p_{θ}, {\hat{I}}_{ϵ, fin} (θ)$ converges to $I_{ϵ} (θ)$ at rate $n^{- 1 / 2}$ , as demonstrated by the following result (Supplement S5). A modified proof can establish the weaker result of consistency of ${\hat{I}}_{ϵ, fin} (θ)$ even when the support of $p_{0}$ is (countably) infinite, provided $s u p p (p_{θ}) \subseteq s u p p (p_{0})$ . The latter condition is mild and holds whenever the model family has a fixed support (e.g. exponential family) and $p_{0}$ is a Huber’s contamination from the model.

Theorem 1.

Suppose that $I_{ϵ_{0}} (θ) < \infty$ for some $ϵ_{0} > 0$ and pick $δ > 0$ and $ϵ > ϵ_{0}$ . If $s u p p (p_{θ}) \subseteq s u p p (p_{0})$ , $s u p p (p_{0})$ is finite, and $x_{1}, \dots, x_{n} \overset{i.i.d.}{~} p_{0}$ , then with probability at least $1 - δ$ ,

| I_{ϵ} (θ) - {\hat{I}}_{ϵ, fin} (θ) | \leq O (\frac{|s u p p (p_{0})|}{ϵ - ϵ_{0}} \sqrt{\frac{1}{n} \log \frac{|s u p p (p_{0})|}{δ}}),

where $s u p p (p) = {x \in 𝒳 : p (x) > 0}$ .

When $𝒳 = R^{d}$ , the above approximation strategy needs to be modified, since $K L (Q ∣ P_{θ})$ is unbounded whenever $P_{θ}$ is a continuous distribution and $Q = \sum_{i = 1}^{n} w_{i} δ_{x_{i}}$ is a discrete distribution. However, using a suitable kernel $κ : 𝒳 \times 𝒳 \to [0, \infty)$ (e.g. $κ_{h} (x, y) = \frac{1}{(\sqrt{2 π} h)^{d}} e^{- \frac{{‖ x - y ‖}^{2}}{2 h^{2}}}$ ) one can establish the continuous space approximation (Supplement S6)

{\hat{I}}_{ϵ} (θ) = \underset{\begin{matrix} w \in {\hat{Δ}}_{n} \\ \frac{1}{2} {‖ w - o ‖}_{1} \leq ϵ \end{matrix}}{i n f} \sum_{i = 1}^{n} w_{i} l o g \frac{n w_{i} \hat{p} (x_{i})}{p_{θ} (x_{i})}

(5)

where $\hat{p}$ is a density estimator for $p_{0}$ based on $x_{1}, \dots, x_{n}, A$ is an $n \times n$ matrix with entries $A_{i j} = \frac{κ (x_{i}, x_{j})}{n \hat{p} (x_{i})}$ , and ${\hat{Δ}}_{n} = A Δ_{n}$ is the image of the $n$ -dimensional probability simplex under linear operator $A$ . We will typically take $\hat{p} (\cdot) = \frac{1}{n} \sum_{j = 1}^{n} κ (\cdot, x_{j})$ to be the kernel-density estimate based on the same kernel $κ$ , in which case $A$ is the stochastic matrix obtained by normalizing the rows of the kernel matrix $K = {(κ (x_{i}, x_{j}))}_{i, j \in [n]}$ to sum to one.

The continuous space approximation in eq. (5) yields the finite space approximation in eq. (4) as a special case when $κ (x, y) = I {x = y}$ is taken to be the indicator kernel. The weights vectors in ${\hat{Δ}}_{n}$ are always non-negative and approximately sum to one for large values of $n$ , since $\sum_{i = 1}^{n} A_{i j} = \frac{1}{n} \sum_{i = 1}^{n} \frac{κ (x_{i}, x_{j})}{\hat{p} (x_{i})} \approx \int \frac{κ (x, x_{j})}{\hat{p} (x)} p_{0} (x) d x \approx \int κ (x, x_{j}) d x = 1$ .

Now we briefly describe the large sample convergence of ${\hat{I}}_{ϵ} (θ)$ to $I_{ϵ} (θ)$ shown in Supplement S6.. Proving this exact statement is technically challenging, as it requires proving uniform-convergence of the objective in eq. (5) when the weights are allowed to vary over the entire range ${\hat{Δ}}_{n}$ . To get around this, we restrict the optimization domain to ${\hat{Δ}}_{n}^{β} = A Δ_{n}^{β}$ where $Δ_{n}^{β} = {(v_{1}, \dots, v_{n}) \in Δ_{n} : v_{i} \in [\frac{β}{n}, \frac{1}{n β}]}$ for a suitably small constant $β > 0$ . Thus, the estimator that we theoretically study is given by

{\hat{I}}_{ϵ, β} (θ) = \underset{\begin{matrix} w \in {\hat{Δ}}_{n}^{β} \\ \frac{1}{2} {‖ w - o ‖}_{1} \leq ϵ \end{matrix}}{i n f} \sum_{i = 1}^{n} w_{i} l o g \frac{n w_{i} \hat{p} (x_{i})}{p_{θ} (x_{i})} .

(6)

Given this change, we can show the following result.

Theorem 2.

Suppose $s u p p (p_{0}) = s u p p (p_{θ}) = 𝒳$ is a compact subset of $R^{d}$ and there exists a constant $γ > 0$ such that $p_{0} (x), p_{θ} (x) \in [γ, 1 / γ]$ for all $x \in 𝒳$ . Suppose that we use the probability kernel $κ (x, y) = \frac{1}{h^{d}} ϕ (| x - y | / h)$ having bandwidth $h > 0$ , where $κ$ is positive semi-definite and $ϕ$ is a bounded function with exponentially decaying tails. Assume that $p_{0}$ and $l o g p_{θ}$ are $α$ -Hölder smooth over $𝒳$ , and suppose that we use the clipped density estimator $\hat{p} (x) = m i n (m a x (\frac{1}{n} \sum_{i = 1}^{n} κ (x_{i}, x), γ), 1 / γ)$ . Then for any constant $0 < β \leq γ^{2} / 4$

|{\hat{I}}_{ϵ, β} (θ) - I_{ϵ} (θ)| \leq \tilde{O} (n^{- \frac{1}{2}} h^{- d} + h^{\frac{α}{2}} + ψ (\sqrt{h})),

with probability at least $1 - 1 / n$ , where $ψ (r) = \frac{λ (𝒳 \ 𝒳_{- r})}{λ (𝒳)}, λ (\cdot)$ is the d-dimensional Lebesgue measure, $𝒳_{- r} = {x \in 𝒳 : B (x, r) \subseteq 𝒳}$ , and $\tilde{O} (\cdot)$ hides constants and logarithmic factors.

Here $ψ (r)$ measures the fraction of the volume of $𝒳$ that is contained in the envelope of width $r$ closest to the boundary. For well-behaved sets, we expect $ψ (r)$ to decrease to 0 as $r \to 0$ . For example, if $𝒳$ is a $d$ -dimensional ball of radius $r_{0}$ , then $ψ (r) = 1 - {(1 - \frac{r}{r_{0}})}^{d}$ .

We prove our theory with the truncated estimator ${\hat{I}}_{ϵ, β} (θ)$ instead of ${\hat{I}}_{ϵ} (θ)$ for technical reasons. By a suitable version of the sandwiching lemma (Supplement S3.2), under the assumptions of Theorem 2, we conjecture that the optimal weights in (5) lie in the set ${\hat{Δ}}_{n}^{β}$ for a small enough constant $β > 0$ , in which case we will have ${\hat{I}}_{ϵ, β} (θ) = {\hat{I}}_{ϵ} (θ)$ .

Given the large sample convergence of ${\hat{I}}_{ϵ} (θ)$ to $I_{ϵ} (θ)$ for every $θ \in Θ$ , one expects that any estimator $\hat{θ} \in Θ$ that is a global minimizer of ${\hat{I}}_{ϵ}$ will approach the set $Θ_{I}$ (Assumption 2.1) as $n \to \infty$ . A qualitative version of this result, proved in Supplement S7, is shown next.

Proposition 1.

Suppose that (i) $𝒳$ and $Θ$ are compact metric spaces, and $(x, θ) \mapsto l o g p_{θ} (x)$ is a continuous function from $𝒳 \times Θ \to R$ . Let $ϵ \geq 0$ be such that (ii) Assumption 2.1 is satisfied. Let $x_{1}, \dots, x_{n} \overset{i.i.d.}{~} p_{0}$ , and suppose that (iii) for any fixed $θ \in Θ$ , we have ${\hat{I}}_{ϵ} (θ) \overset{P}{\to} I_{ϵ} (θ)$ as $n \to \infty$ . If an estimator $\hat{θ} \in Θ$ satisfies $|{\hat{I}}_{ϵ} (\hat{θ}) - {m i n}_{θ \in Θ} {\hat{I}}_{ϵ} (θ)| \overset{P}{\to} 0$ , then for every $δ > 0$

\underset{n \to \infty}{l i m} P (d_{T V} (P_{\hat{θ}}, P_{0}) > ϵ + δ) = 0 .

Theorem 1 and Theorem 2 (with the conjectured $β = 0$ ) provide sufficient conditions for condition (iii) in Proposition 1 to hold. Condition (i) is primarily used to show that the convergence in (iii) is uniform, i.e. $\underset{θ \in Θ}{s u p} | {\hat{I}}_{ϵ} (θ) - I_{ϵ} (θ) | \overset{P}{\to} 0$ . Finally, condition (ii) is satisfied when the data $P_{0} = (1 - ϵ) P_{θ^{*}} + ϵ C$ is a contamination. Proposition 1 and triangle inequality then bound the model estimation error: $d_{T V} (P_{\hat{θ}}, P_{θ^{*}}) \leq 2 ϵ + o_{P} (1)$ .

2.2.2. OWL Methodology

Based on the computable approximation ${\hat{I}}_{ϵ}$ (or ${\hat{I}}_{ϵ, f in}$ ) to the OKL function, Algorithm 1 uses an alternating optimization to minimize ${\hat{I}}_{ϵ} (θ)$ over $θ$ . The procedure simultaneously estimates a minimizer $\hat{θ}$ and optimistic weights $w_{1}, \dots, w_{n} \geq 0$ that sum to $n$ .

In Supplement S8, we expand further on the computational details for the $θ$ -step and $w$ -step in Algorithm 1. In particular, the $w$ -step is a convex optimization problem which we can solve by using the consensus ADMM algorithm (Boyd et al., 2011; Parikh and Boyd, 2014) based on proximal operators that can be efficiently computed. Further, the $θ$ -step corresponds to maximizing a weighted likelihood, which can be performed for many models through simple modifications of procedures for the corresponding maximum likelihood estimation. As shown in Supplement S8, the latter computation is easy to perform exactly for exponential families, while for mixture models we can use a weighted variant of the ‘hard EM’ algorithm (Samdani et al., 2012).

In Supplement S10, we show that the iterates ${\{θ_{t}\}}_{t \geq 1}$ of Algorithm 1 will decrease the objective value $θ \mapsto {\hat{I}}_{ϵ} (θ)$ at each step, and that Algorithm 1 is an instance of the well-studied Majorization-Minimization (MM) (Hunter and Lange, 2004) and Difference of Convex (DC) (Le Thi and Pham Dinh, 2018) class of algorithms. We use this to analyze the convergence of ${\{θ_{t}\}}_{t \geq 1}$ when ${\{p_{θ}\}}_{θ \in Θ}$ is an exponential family (see Remark S10.1).

Algorithm 1.

OWL Methodology

Input: Model ${\{p_{θ}\}}_{θ \in Θ}$ , radius $ϵ \geq 0$ , kernel $κ$ , initial $θ_{1} \in Θ$ , and iteration limit $T$ .

for $t = 1, \dots, T$ do

$w$ -step: Find $w_{t} = (w_{t, 1}, \dots, w_{t, n}) \in R_{\geq 0}^{n}$ that minimizes eq. (5) for $θ = θ_{t}$ .

$θ$ -step: Find $θ_{t + 1}$ that maximizes the weighted likelihood $θ \mapsto \sum_{i = 1}^{n} w_{t, i} l o g p_{θ} (x_{i})$ .

end for

Output: The robust parameter estimate $θ_{T}$ and the data weights $n w_{T}$ .

Open in a new tab

In practice, when the data lie in a continuous space, we often avoid using the kernel-based estimator eq. (5) to determine the weights in the $w$ -step of Algorithm 1 because it greatly slows down the computation (see Supplement S8.1.3). Instead, we often perform the $w$ -step by solving the unkernelized optimization problem using $κ (x, y) = I {x = y}$ . We demonstrate via simulations (Section 4) that the unkernelized version of the OWL procedure has equally good performance compared to the kernelized version with a suitably tuned bandwidth. The theoretical impact of this approximation can be explored by studying the limiting behavior of the OKL estimator for suitably fixed choices of $κ$ as $n \to \infty$ .

2.2.3. Setting the corruption fraction $ϵ$

So far, we have assumed that the parameter $ϵ \in (0,1)$ , which can be interpreted as the fraction of corrupted samples in the population distribution, is fixed at a known value that satisfies Assumption 2.1. Now let us see how the population level analysis (Section 2.1) can inform our choice of $ϵ$ . Assumption 2.1 is satisfied as long as $ϵ \geq ε_{0}$ , where

ε_{0} = \underset{θ \in Θ}{m i n} d_{T V} (P_{0}, P_{θ}) = \{ϵ \in [0,1] : \underset{θ \in Θ}{m i n} I_{ϵ} (θ) = 0\} .

Hence, in principle, we could set $ϵ = ε_{0}$ to use OWL to perform minimum-TV estimation (Yatracos, 1985), which has the following advantages: (1) while directly minimizing TV distance is computationally intractable, the OWL methodology decomposes this problem into alternating convex optimization and weighted MLE steps, both of which are standard problems that often tend to be well-behaved, and (2) the OWL methodology provides us with weight vectors that can indicate outlying observations and relates minimum TV-estimation to likelihood based inference.

In order to choose $ϵ \approx ε_{0}$ in practice, we define the function $\hat{g} (ϵ) = {\hat{I}}_{ϵ} ({\hat{θ}}_{ϵ})$ , where ${\hat{θ}}_{ϵ}$ is the parameter estimate computed by the OWL procedure for a given $ϵ$ . At the population level, the corresponding function $g (ϵ) = {m i n}_{θ \in Θ} I_{ϵ} (θ)$ is monotonically decreasing in $ϵ$ until $ϵ = ε_{0}$ , at which point it remains at 0. This introduces a kink, or elbow, at $ε_{0}$ that we hope to identify in the sample estimate $\hat{g}$ . Thus, our $ϵ$ search procedure is to compute $\hat{g}$ over a fixed grid of $ϵ$ -values, smooth the resulting grid, and then select amongst the points of largest curvature (computed numerically), where the curvature of a twice-differentiable function $f$ at a point $x$ is given by $f^{''} (x) / {(1 + f^{'} (x)^{2})}^{1.5}$ (Satopaa et al., 2011). Despite the various approximations involved, our simulation results (Section 4) show that the OWL procedure with such a tuned value of $ϵ$ provides almost identical performance when compared with the OWL procedure with the true value of $ϵ$ .

2.3. OWL extension to non-identically-distributed data

While the population level analysis and theoretical results for the OKL estimator were derived under the assumption that data are generated i.i.d. from a distribution $P_{0}$ , the OWL procedure can be adapted to robustify likelihood based inference in the setting where the data are conditionally independent, but not necessarily identically distributed.

Suppose data $z_{1}, \dots, z_{n} \in 𝒵$ are conditionally independent, with the likelihood having the product form $p_{θ} (z_{1 : n}) = \prod_{i = 1}^{n} p_{θ, i} (z_{i})$ , for known functions ${\{p_{θ, i}\}}_{i = 1}^{n}$ . For example, if $z_{i} = (y_{i}, x_{i}) \in R \times 𝒳$ for $i = 1, \dots, n$ , this includes the case of regression models ${\{q_{θ} (y ∣ x)\}}_{θ \in Θ}$ under the setup $p_{θ, i} (z_{i}) = q_{θ} (y_{i} ∣ x_{i})$ . Another example of this setup includes mixture models if we expand the parameter space to also include cluster assignments (see Supplement S8.2).

To robustify inference based on the product likelihood $p_{θ} (z_{1 : n}) = \prod_{i = 1}^{n} p_{θ, i} (z_{i})$ , we can replace the $w$ - and $θ$ - steps in Algorithm 1 by analogous steps in the product likelihood case. In particular, the modified $w$ -step is given by

w_{t} = \underset{\begin{matrix} w \in Δ_{n} \\ \frac{1}{2} {‖ w - o ‖}_{1} \leq ϵ \end{matrix}}{a r g m i n} {- \sum_{i = 1}^{n} w_{i} l o g p_{θ_{t}, i} (x_{i}) + \sum_{i = 1}^{n} w_{i} l o g w_{i}}

and the modified $θ$ -step is given by

θ_{t + 1} = {a r g m i n}_{θ \in Θ} \sum_{i = 1}^{n} w_{t, i} l o g p_{θ, i} (x_{i}) .

Despite our lack of theory in the non-identically-distributed case, we continue to see good empirical performance of OWL (see Section 4, Section 5, and Supplement S12).

3. Asymptotic connection to coarsened inference

The development of the OWL methodology in Section 2 followed from a presumed form of misspecification given by Assumption 2.1. An alternative way to frame and address such misspecifications in a probabilistic framework was proposed by Miller and Dunson (2019) who introduced a Bayesian methodology centered around the concept of a coarsened likelihood defined as

L_{ϵ} (θ ∣ x_{1 : n}) ≐ P_{θ} (d ({\hat{P}}_{Z_{1 : n}}, {\hat{P}}_{x_{1 : n}}) \leq ϵ),

(7)

where $d$ is a suitably chosen discrepancy between empirical probability measures. Here, ${\hat{P}}_{x_{1 : n}} = n^{- 1} \sum_{i = 1}^{n} δ_{x_{i}}$ denotes the empirical distribution of data $x_{1 : n}$ , and the probability is computed under $P_{θ}$ —the distribution underlying the artificial data $Z_{1}, \dots Z_{n} \overset{i.i.d .}{~} P_{θ}$ from which the random measure ${\hat{P}}_{Z_{1 : n}} = n^{- 1} \sum_{i = 1}^{n} δ_{Z_{i}}$ is constructed. The coarsened likelihood implicitly captures the likelihood of a probabilistic procedure in which idealized data are first generated by some model $P_{θ}$ in the model class under consideration, but are then corrupted in such a way that the discrepancy between empirical measures of the idealized data and the observed data is bounded by $ϵ$ .

When $d$ is an estimator for the KL-divergence and an exponential prior is placed on $ϵ$ , Miller and Dunson (2019) showed that the Bayes posterior based on $L_{ϵ} (θ ∣ x_{1 : n})$ could be approximated by raising the likelihood to a power less than one in the formula for the standard posterior. However, to obtain a robustified alternative to maximum likelihood estimation, one may wish to maximize $θ \mapsto L_{ϵ} (θ ∣ x_{1 : n})$ directly for a choice of $d$ that guarantees robustness (e.g. Maximum Mean Discrepancy or the TV distance). Such an approach would in general be quite challenging since evaluating eq. (7) corresponds to computing a high-dimensional integral.

In this section, we show that for large $n$ , the coarsened likelihood can be approximately maximized via the OWL methodology when $d$ is an estimator for the TV distance. Specifically, if the observed data $x_{1}, \dots, x_{n}$ are generated i.i.d. from some distribution $P_{0}$ and $d$ satisfies some regularity conditions, then the quantity $- \frac{1}{n} l o g L_{ϵ} (θ ∣ x_{1 - n})$ asymptotically converges as $n \to \infty$ to a variant of $I_{ϵ} (θ)$ based on $d$ . Hence, the OWL methodology asymptotically maximizes the coarsened likelihood $θ \mapsto L_{ϵ} (θ ∣ x_{1 : n})$ .

We state this asymptotic connection first for finite spaces and then for continuous spaces. These results rely on Sanov’s theorem from large deviation theory (Dembo and Zeitouni, 2010) and are proved in Supplement S9.

3.1. Asymptotic connection in finite spaces

Let $𝒳$ be a finite set and denote the space of probability distributions on $𝒳$ by the simplex $Δ_{𝒳} ≐ {q \in {[0,1]}^{𝒳} ∣ \sum_{χ \in 𝒳} q (x) = 1}$ . Let ${\{p_{θ}\}}_{θ \in Θ} \subseteq Δ_{𝒳}$ denote the collection of model distributions, and $p_{0} \in Δ_{𝒳}$ denote the true data generating distribution. To connect the OKL with the coarsened likelihood in eq. (7), we take $d (p, q) = \frac{1}{2} ‖ p - q ‖ 1$ .

Given this setting, we can show that $- \frac{1}{n} l o g L_{ϵ} (θ ∣ x_{1 : n})$ converges in probability to the OKL function $I_{ϵ} (θ)$ at a rate of $n^{- 1 / 2}$ , as demonstrated by the following theorem.

Theorem 3.

Suppose that $I_{ϵ_{0}} (θ) < \infty$ for some $ϵ_{0} > 0$ and let $δ > 0$ . If $ϵ > ϵ_{0}$ and $x_{1}, x_{2}, \dots, x_{n} \overset{i.i.d.}{~} p_{0}$ , then with probability at least $1 - δ$ ,

|I_{ϵ} (θ) + \frac{1}{n} l o g L_{ϵ} (θ ∣ x_{1 : n})| \leq O (\frac{|𝒳|}{ϵ - ϵ_{0}} \sqrt{\frac{1}{n} \log \frac{1}{δ}} + \frac{|𝒳|}{n} (|𝒳| + \log (n) + \frac{1}{ϵ - ϵ_{0}})) .

Theorem 1 and Theorem 3 together show that, in the large sample limit, the OWL methodology and coarsened likelihood philosophy are two sides of the same coin: they both provide approximations of the OKL and, in turn, must approximate each other.

3.2. Asymptotic connection in continuous spaces

Suppose $𝒳$ is a Polish space (e.g. $𝒳 = R^{d}$ ). Similar to the finite case, we can establish the following asymptotics for the coarsened likelihood for a suitable class of discrepancies $d$ , which includes the Wasserstein distance, Maximum Mean Discrepancy with suitable choice of kernels (Simon-Gabriel and Schölkopf, 2018), and the smoothed TV distance (Definition S9.1 in Supplement S9.3).

Theorem 4.

Suppose $I_{ϵ_{0}} (θ) < \infty$ for some $ϵ_{0} > 0$ and $d : 𝒫 (𝒳) \times 𝒫 (𝒳) \to [0, \infty)$ is a pseudometric that is convex in its arguments and continuous with respect to the weak convergence topology on $𝒫 (𝒳)$ . If $ϵ > ϵ_{0}$ and $x_{1}, \dots, x_{n} \overset{i.i.d.}{~} P_{0}$ , then

- \frac{1}{n} \log L_{ϵ} (θ ∣ x_{1 : n}) \overset{P}{\to} \underset{\begin{matrix} Q \in 𝒫 (𝒳) \\ d (Q, P_{0}) \leq ϵ \end{matrix}}{i n f} K L (Q ∣ P_{θ}) a s n \to \infty .

Recall that the limiting expression in the above theorem has the same form as that of the OKL function given in eq. (3). However, in order to establish connection between the OKL function and the coarsened likelihood, unlike in the finite case, we cannot merely take the discrepancy $d$ in the coarsened likelihood to be the TV distance, since the TV distance between the two empirical distributions in eq. (7) will almost surely be equal to one. Instead, we will take $d$ to be a smoothed version of TV distance calculated by first convolving the empirical measures with a smooth kernel function $K_{h} : 𝒳 \times 𝒳 \to [0, \infty)$ indexed by a bandwidth parameter $h > 0$ . Further details can be found in Supplement S9.3.

4. Simulation Examples

We now demonstrate the OWL methodology in simulated examples with artificially injected corruptions. In each simulation, we considered two methods for choosing the points to corrupt: (i) max-likelihood corruption where we fit a maximum likelihood estimate to the uncorrupted data and select the points with the highest likelihood; and (ii) random corruption where we choose the points to corrupt uniformly at random. We ran each simulation with 50 initial seeds, plotting the mean performance and its 95% confidence band. For clarity and space, we present only the results for max-likelihood corruptions in this section, and we defer the results for randomly-selected corruptions to Supplement S11.

In all comparisons, OWL refers to our methodology with the data-based choice of the corruption fraction $ϵ$ as described in Section 2.2.3, while OWL ( $ϵ$ known) refers to our methodology with $ϵ$ equal to the true level of corruption in the data. MLE refers to the standard maximum likelihood estimate. The Pearson residuals baseline is the method from Markatou et al. (1998) based on the Hellinger residual adjustment function: a weighted-likelihood based method aimed at performing minimum Hellinger-distance estimation. For settings where the outcomes were continuous, the Pearson residual density estimate was computed via kernel density estimation, with the bandwidth parameter selected to minimize the empirical Hellinger distance between the final model and the density estimate. For the linear regression baselines, ridge regression was performed with L2 penalty chosen via cross validation, and Huber regression used the commonly-accepted Huber penalty of 1.345 (Huber, 1964). For the logistic regression baselines, L2-regularized MLE selected the L2 penalty via cross validation. For both regression settings, Random Sample Consensus (RANSAC) utilized the ground-truth corruption fraction.

Gaussian modeling.

Our first simulation fit a multivariate normal distribution to data generated from a $p$ -dimensional spherical normal distribution. The ground truth mean was drawn uniformly from $[- 10, 10]^{p}$ , and the corrupted data points were also drawn uniformly from $[- 10, 10]^{p}$ . The total number of points was 200, and we measured the dimension-normalized mean-squared error (MSE) of ground-truth mean parameter recovery.

Figure 3a shows the results of the Gaussian simulation. In particular, we see that both the kernelized and the unkernelized versions of OWL had almost identical performance in this example, achieving substantial robustness to corruptions over the sample mean. In light of this similarity of performance, the rest of our simulation examples utilize the much more computationally efficient unkernelized version of OWL. We also observed that the kernelized version of OWL was not particularly sensitive to the choice of bandwidth parameter (Figure S3). We further observe that on this example, the Pearson residuals method outperforms OWL, although the difference in performance is small relative to the gains of all methods over MLE.

Linear regression.

We considered a homoscedastic model with normally distributed errors. We considered two datasets. The first is a simulated dataset with 10-dimensional i.i.d. standard normal covariates. The ground-truth regression coefficients were drawn independently from $𝒩 (0, 4)$ , the intercept was set to 0, and the residual standard deviation was $σ = 1 / 4$ . The training set consisted of 1,000 data points. For the test set, we drew 1,000 new data points and computed the MSE on the underlying mean response.

The second dataset was taken from a quantitative structure activity relationship (QSAR) dataset (Olier, 2020) compiled by Olier et al. (2018) from the ChEMBL database. It consists of 5012 chemical compounds whose activities were measured on the epidermal growth factor receptor protein erbB1. The activities were recorded as the negative log of the chemical concentration that inhibited 50% of the protein target, i.e. the pIC₅₀ value. Each compound had 1024 associated binary covariates, corresponding to the 1024-dimensional FCFP4 fingerprint representation of the molecule (Rogers and Hahn, 2010). We used PCA to reduce the dimension to 50. For every random seed, we computed a random 80/20 train/test split. The test MSE on this dataset is the standard MSE over the test responses. In both datasets, for each data point selected to be corrupted, we corrupted the responses by fitting a least squares solution and observing the residuals: if the residual is positive, we set the response to be 3v where $v$ is the largest absolute value observed value in the training set responses, otherwise setting it to $- 3 v$ .

Figure 3b shows the results of the linear regression simulations for the max-likelihood corruptions. Across both datasets, we see that OWL is competitive with the best of the robust regression methods, whether that method is RANSAC or Huber regression.

Logistic regression.

For the logistic regression setting, we have parameters $b \in R, w \in R^{p}$ and observations $(x_{i}, y_{i}) \in R^{p + 1}$ assumed to follow the distribution

y_{i} ~ B e r n o u l l i (\frac{1}{1 + \exp (- ⟨x_{i}, w⟩ - b)}) .

We considered three datasets. The first is a simulated dataset using the same parameters as the linear regression setting. The training labels are created according to the generative model. For test accuracy, we computed accuracy against the true sign-values, i.e. $I \{⟨w^{⋆}, x⟩ \geq 0\}$ .

The second dataset is taken from the MNIST handwritten digit classification dataset (LeCun et al., 1998). We considered the problem of classifying the digit ‘1’ v.s. the digit ‘8,’ resulting in a dataset with 14702 data points and 784 covariates, representing pixel intensities. The third dataset is a collection of 5172 documents from the Enron spam classification dataset, preprocessed to contain 5116 covariates, representing word counts (Metsis et al., 2006). For both the MNIST and the Enron spam datasets, we reduced the dimensionality to 10 via PCA and used a random 80/20 train/test split.

Figure 3c shows the results of the logistic regression simulations. Across all datasets, OWL again outperforms the other approaches in the presence of corruption.

Gaussian mixture models.

Recall the standard Gaussian mixture modeling setup: there are a collection of means $μ_{1}, \dots, μ_{K} \in R^{p}$ , standard deviations $σ_{1}, \dots, σ_{K} > 0$ , and mixing weights $π \in Δ^{K}$ . Data points $x_{i} \in R^{p}$ are drawn i.i.d. according to

X_{i} ~ \sum_{k = 1}^{K} π_{k} 𝒩 (μ_{k}, σ_{k}^{2} I_{p}) .

For our simulations, we generated a synthetic dataset of 1000 points in $R^{10}$ by first drawing $K = 3$ means $μ_{1}^{⋆}, \dots, μ_{K}^{⋆}$ whose coordinates are i.i.d. Gaussian with standard deviation 2. We set $σ_{1} = σ_{2} = σ_{3} = 1 / 2$ and $π_{1} = π_{2} = π_{3} = 1 / 3$ . To corrupt a data point, we randomly selected half of its coordinates and set them randomly to either a large positive value or a large negative value (here, 5 and −5). As a metric, we measured the average mean squared Euclidean distance between the means of the fitted model and the ground truth model.

For all methods, we used random restarts, choosing the final model based on the method’s criterion: likelihood for MLE, the OKL estimate for OWL, and empirical Hellinger distance for Pearson residuals. We see that OWL remains robust against varying levels of corruptions, whereas both MLE and Pearson residuals perform significantly worse (left panel of Figure 3d).

Bernoulli product mixture models.

Consider the following model for $p$ -dimensional binary data: there are a collection of probability vectors $λ_{1}, \dots, λ_{K} \in [0,1]^{p}$ and mixing weights $π \in Δ^{K}$ . Each data point $x_{i}$ is drawn i.i.d. according to the process

z_{i} ~ C a t e g o r i c a l (π)

x_{i j} ~ B e r n o u l l i (λ_{z_{i} j}) for j = 1, \dots, p .

For our simulations, we generated a synthetic dataset of 1000 points in {0,1}¹⁰⁰ by first drawing $K = 3$ means $λ_{1}^{⋆}, \dots, λ_{K}^{⋆}$ whose coordinates are i.i.d. from a $B e t a (1 / 10,1 / 10)$ distribution. The mixing weights were chosen to be uniform over the components. To corrupt a data point, we flipped each zero coordinate with probability 1/2. As a metric, we measured the average $ℓ_{1}$ -distance between the $λ$ parameters of the fitted model and the ground truth model.

The right panel of Figure 3d shows the results of the Bernoulli mixture model simulations. We see that OWL remains robust against varying levels of corruptions, whereas both MLE and Pearson residuals perform significantly worse.

5. Application to micro-credit study

In this section we apply OWL to data from a micro-credit study for which maximum likelihood estimation (MLE) is shown to be brittle. In Angelucci et al. (2015), the authors worked with one of the largest micro-lenders in Mexico to randomize their credit rollout across 238 geographical regions in the Sonora state. Within 18–34 months after this rollout, the authors surveyed $n = 16, 560$ households for various outcome measures.

Following Broderick et al. (2023), here we focus on the Average Intention to Treat effect (AIT) of the rollout on household profits. For $i \in {1, \dots, n}$ , let $Y_{i}$ denote the profit of the ith household during the last fortnight (in USD PPP units), and let $T_{i} \in {0,1}$ be a binary variable that is one if the household $i$ falls in the geographical region where the rollout happened. The AIT on household profits is defined as the coefficient $β_{1}$ in the model:

Y_{i} = β_{0} + β_{1} T_{i} + ε_{i}, ε_{i} \overset{i.i.d .}{~} N (0, σ^{2}), i \in {1, \dots, n} .

(8)

In Supplement S13.1, we reproduce the brittleness in estimating $β_{1}$ using the MLE as demonstrated in Broderick et al. (2023) by removing an handful of observations.

Here we compare OWL to the above data deletion approach. We fit the model (8) to the full data set using 50 log₁₀ -spaced $ϵ$ -values between 10⁻⁴ and 10⁻¹, and used the tuning procedure in Section 2.2.3 to select the value $ϵ_{0} = 0.005$ where the minimum-OKL versus epsilon plot (Supplement S13, Figure S9) has its most prominent kink. We also calculate the MLE, which corresponds to OWL with $ϵ = 0$ . The AIT on household profit estimated by OWL as a function of $ϵ$ can be seen in the left panel of Figure 4. For values of $ϵ$ below $ϵ_{0}$ , the AIT estimates change rapidly as $ϵ$ changes, while for values of $ϵ$ above $ϵ_{0}$ , the AIT estimates are quite stable with changes in $ϵ$ . This is due to OWL automatically down-weighting the outlying observations, as seen in the right panel of Figure 4. We quantify uncertainty in our estimates by using a variant of the Bootstrap (Supplement S13.2).

Figure 4: — Estimating the Average Intention to Treat (AIT) effect on household profits in the micro-credit study Angelucci et al. (2015) in the presence of outliers. Left: the AIT estimates using OWL for various values of $ϵ$ along with 90% bootstrap vertical confidence bands. The vertical line is drawn at the value $ϵ_{0} = 0.005$ obtained by the tuning procedure in Section 2.2.3, and roughly coincides with the $ϵ$ beyond which the AIT estimates stabilize and the size of the confidence bands shrinks (see Supplement S13). Right: shows that the weights estimated by OWL procedure at $ϵ = ϵ_{0}$ down-weight roughly 1% of the households that have outlying profit values (for visual clarity, we omit a down-weighted household with profit less that $- 40 K$ USD PPP); see also Figure S10 in Supplement S13.

In summary, the OWL procedure chose to down-weight roughly 1% of the households with extreme profit values and estimated an AIT of $β_{1} = 0.6$ USD PPP per fortnight based on the selected value of $ϵ_{0} = 0.005$ . The value $ϵ = ϵ_{0}$ , tuned using the procedure in Section 2.2.3, roughly coincides with the point at which the AIT estimates become stable with respect to $ϵ$ and also with the point at which the 90% confidence bands for AIT become narrower — both suggesting that OWL with the choice $ϵ = ϵ_{0}$ has identified and down-weighted outliers that may be causing brittleness in estimating AIT.

6. Discussion

In this paper, we introduced the optimistically weighted likelihood (OWL) methodology, motivated by brittleness issues arising from misspecification in statistical methodology based on standard likelihoods. On the theoretical side, we established the consistency and robustness of our approach and showed its asymptotic connection to the coarsened inference methodology. We also proposed a feasible alternating optimization scheme to implement the methodology and demonstrated its empirical utility on both simulated and real data.

The OWL methodology opens up several interesting future directions. One practical open problem is how to scale to larger datasets. As a weighted likelihood method, OWL requires solving for a weight vector whose dimension is the size of the dataset. While we can solve the resulting convex optimization problem for thousands of data points, the procedure becomes significantly more complicated when the size of the dataset exceeds computer memory. How do we maintain a feasible solution, i.e. one that lies in the intersection of the simplex and some probability ball, when the entire vector cannot fit in memory?

Another practical question is how to apply the OWL approach in more complex models; for example, involving dependent data. This may be relatively straightforward for models in which the likelihood can still be written in product form due to conditional independence given random effects. This would open up its application to nested, longitudinal, spatial and temporal data, as random effects models are routinely used in such settings.

Finally, it will be fruitful to implement the OWL method for a choice of $d$ other than the TV distance. Indeed, as reflected in Assumption 2.1 and our robustness results, the choice of $d$ controls the nature of robustness offered by OWL. An important choice for $d$ may be the Wasserstein distance because it metrizes the weak convergence topology on bounded spaces (Huber and Ronchetti, 2009, Section 2.4) and, as argued in Chapter 1 of Huber and Ronchetti (2009), can capture robustness to both small changes in all observations (e.g. due to rounding or grouping) and large changes in a few observations (e.g. due to contamination or outliers). Indeed, while several of our theoretical results already hold for the Wasserstein metric, appropriate modifications to the OKL estimator and the OWL algorithm will be needed to make this approach feasible.

Supplementary Material

Supp 1

NIHMS2061361-supplement-Supp_1.pdf^{(3.4MB, pdf)}

The accompanying supplementary materials contain a real data application to cluster single-cell RNAseq data, along with additional details (figures, lemmas, definitions, etc. with names starting with the letter ‘S’) referenced in the article.

Acknowledgement

The authors acknowledge funding from grants N00014–21-1–2510-P00001 from the Office of Naval Research (ONR) and R01 ES027498, R01 ES035625, U54 CA274492–01 and R37CA271186 from the National Institutes of Health, as well as helpful discussions with Sayan Mukherjee and Amarjit Budhiraja. The authors also wish to acknowledge the anonymous referees for pointing us to relevant literature.

Footnotes

The authors report that there are no competing interests to declare.

See https://github.com/cjtosh/owl for code to reproduce all analyses.

References

Shun’ichi Amari. Information Geometry and its Applications. Springer, 2016. [Google Scholar]
Angelucci Manuela, Karlan Dean, and Zinman Jonathan. Microcredit impacts: Evidence from a randomized microcredit program placement experiment by Compartamos Banco. American Economic Journal: Applied Economics, 7(1):151–182, 2015. [Google Scholar]
Medina Marco Avella and Ronchetti Elvezio. Robust statistics: A selective overview and new directions. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6):372–393, 2015. [Google Scholar]
Avella-Medina Marco and Ronchetti Elvezio. Robust and consistent variable selection in high-dimensional generalized linear models. Biometrika, 105(1):31–44, 2018. [Google Scholar]
Basu Ayanendranath and Lindsay Bruce G. Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, 46:683–705, 1994. [Google Scholar]
Basu Ayanendranath, Harris Ian R, Hjort Nils L, and Jones MC. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998. [Google Scholar]
Beath Ken J. A mixture-based approach to robust analysis of generalised linear models. Journal of Applied Statistics, 45(12):2256–2268, 2018. [Google Scholar]
Bernton Espen, Jacob Pierre E, Gerber Mathieu, and Robert Christian P. On parameter estimation with the Wasserstein distance. Information and Inference: A Journal of the IMA, 8(4):657–676, 2019. [Google Scholar]
Bondell Howard D and Stefanski Leonard A. Efficient robust regression via two-stage generalized empirical likelihood. Journal of the American Statistical Association, 108(502):644–655, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends^® in Machine learning, 3(1):1–122, 2011. [Google Scholar]
Bradic Jelena. Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electronic Journal of Statistics, 10(2):3894–3944, 2016. [Google Scholar]
Bradic Jelena, Fan Jianqing, and Wang Weiwei. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):325–349, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Briol Francois-Xavier, Barp Alessandro, Duncan Andrew B, and Girolami Mark. Statistical inference for generative models with maximum mean discrepancy. arXiv preprint arXiv:1906.05944, 2019. [Google Scholar]
Broderick Tamara, Giordano Ryan, and Meager Rachael. An automatic finite-sample robustness metric: When can dropping a little data make a big difference? arXiv preprint arXiv:2011.14999, 2023. [Google Scholar]
Cai Diana, Campbell Trevor, and Broderick Tamara. Finite mixture models do not reliably learn the number of components. In International Conference on Machine Learning, pages 1158–1169, 2021. [Google Scholar]
Chandra Noirrit Kiran, Canale Antonio, and Dunson David B. Escaping the curse of dimensionality in Bayesian model-based clustering. Journal of Machine Learning Research, 24(144):1–42, 2023. [Google Scholar]
Chen Mengjie, Gao Chao, and Ren Zhao. A general decision theory for Huber’s $ϵ$ - contamination model. Electronic Journal of Statistics, 10:3752–3774, 2016. [Google Scholar]
Cherief-Abdellatif Badr-Eddine and Alquier Pierre. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference, pages 1–21, 2020. [Google Scholar]
Choi Edwin, Hall Peter, and Presnell Brett. Rendering parametric procedures more robust by empirically tilting the model. Biometrika, 87(2):453–465, 2000. [Google Scholar]
Claeskens Gerda and Hjort Nils Lid. Model Selection and Model Averaging. Cambridge University Press, 2008. [Google Scholar]
Cover Thomas M and Thomas Joy A. Elements of Information Theory. John Wiley & Sons, second edition, 2006. [Google Scholar]
Csiszár Imre. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975. [Google Scholar]
Dembo Amir and Zeitouni Ofer. Large Deviations Techniques and Applications. Springer Berlin Heidelberg, 2010. [Google Scholar]
Donoho David L and Liu Richard C. The “automatic” robustness of minimum distance functionals. The Annals of Statistics, 16(2):552–586, 1988. [Google Scholar]
Dupuis Debbie J and Morgenthaler Stephan. Robust weighted likelihood estimators with an application to bivariate extreme value problems. Canadian Journal of Statistics, 30(1):17–36, 2002. [Google Scholar]
Field C and Smith B. Robust estimation: A weighted maximum likelihood approach. International Statistical Review/Revue Internationale de Statistique, 62(3):405–424, 1994. [Google Scholar]
Ghosh Abhik and Basu Ayanendranath. A new family of divergences originating from model adequacy tests and application to robust statistical inference. IEEE Transactions on Information Theory, 64(8):5581–5591, 2018. [Google Scholar]
Greco Luca and Agostinelli Claudio. Weighted likelihood mixture modeling and model-based clustering. Statistics and Computing, 30(2):255–277, 2020. [Google Scholar]
Hampel Frank R., Ronchetti Elvezio M., Rousseeuw Peter J., and Stahel Werner A.. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, 2005. [Google Scholar]
Huber Peter J. Robust estimation of a location parameter. Ann. Math. Statist, 35(4):73–101, 1964. [Google Scholar]
Huber Peter J. and Ronchetti Elvezio M.. Robust Statistics. Wiley, 2009. [Google Scholar]
Huber-Carol Catherine, Balakrishnan Narayanaswamy, Nikulin Mikhail, and Mesbah Mounir. Goodness-of-fit Tests and Model Validity. Springer Science & Business Media, 2012. [Google Scholar]
Hüllermeier Eyke. Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization. International Journal of Approximate Reasoning, 55(7):1519–1534, 2014. [Google Scholar]
Hüllermeier Eyke and Cheng Weiwei. Superset learning based on generalized loss minimization. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part II, pages 260–275, 2015. [Google Scholar]
Hunter David R and Lange Kenneth. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004. [Google Scholar]
Thi Hoai An Le and Dinh Tao Pham. DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68, 2018. [Google Scholar]
LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [Google Scholar]
Lienen Julian and Hüllermeier Eyke. Instance weighting through data imprecisiation. International Journal of Approximate Reasoning, 134:1–14, 2021. [Google Scholar]
Lindsay Bruce G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. The Annals of Statistics, 22(2):1081–1114, 1994. [Google Scholar]
Liu Allen and Moitra Ankur. Robustly learning general mixtures of Gaussians. Journal of the ACM, 70(3):1–53, 2023. [Google Scholar]
Liu Jiawei and Lindsay Bruce G.. Building and using semiparametric tolerance regions for parametric multinomial models. The Annals of Statistics, 37(6A):3644–3659, 2009. [Google Scholar]
Mancini Loriano, Ronchetti Elvezio, and Trojani Fabio. Optimal conditionally unbiased bounded-influence inference in dynamic location and scale models. Journal of the American Statistical Association, 100(470):628–641, 2005. [Google Scholar]
Markatou Marianthi. Mixture models, robustness, and the weighted likelihood methodology. Biometrics, 56(2):483–486, 2000. [DOI] [PubMed] [Google Scholar]
Markatou Marianthi, Basu Ayanendranath, and Lindsay Bruce G. Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association, 93(442):740–750, 1998. [Google Scholar]
Maronna Ricardo A, Douglas Martin R, Yohai Victor J, and Salibián-Barrera Matías. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019. [Google Scholar]
Metsis Vangelis, Androutsopoulos Ion, and Paliouras Georgios. Spam filtering with naive Bayes - which naive Bayes? In CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27–28, 2006, Mountain View, California, USA, 2006. [Google Scholar]
Miller Jeffrey W. and Dunson David B.. Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olier Ivan, 2020. QSAR datasets – Meta-QSAR, V1. Mendeley Data; 10.17632/spwgrcnjdg.1. [DOI] [Google Scholar]
Olier Ivan, Sadawi Noureddin, Richard Bickerton G, Vanschoren Joaquin, Grosan Crina, Soldatova Larisa, and King Ross D. Meta-QSAR: A large-scale application of meta-learning to drug design and discovery. Machine Learning, 107(1):285–311, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parikh Neal and Boyd Stephen. Proximal algorithms. Foundations and Trends^® in Optimization, 1(3):127–239, 2014. [Google Scholar]
Rogers David and Hahn Mathew. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, 2010. [DOI] [PubMed] [Google Scholar]
Rousseeuw Peter J and Leroy Annick M. Robust regression and outlier detection. John Wiley & Sons, 2005. [Google Scholar]
Royall Richard. Statistical evidence: a likelihood paradigm. Routledge, 2017. [Google Scholar]
Samdani Rajhans, Chang Ming-Wei, and Roth Dan. Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 688–698, 2012. [Google Scholar]
Satopaa Ville, Albrecht Jeannie, Irwin David, and Raghavan Barath. Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 31st International Conference on Distributed Computing Systems Workshops, pages 166–171, 2011. [Google Scholar]
She Yiyuan and Owen Art B. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639, 2011. [Google Scholar]
Simon-Gabriel Carl-Johann and Schölkopf Bernhard. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. The Journal of Machine Learning Research, 19(1):1708–1736, 2018. [Google Scholar]
Taplin Ross H. Robust likelihood calculation for time series. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4):829–836, 1993. [Google Scholar]
Tsou Tsung-Shan and Royall Richard M. Robust likelihoods. Journal of the American Statistical Association, 90(429):316–320, 1995. [Google Scholar]
Parr William C. Minimum distance estimation: a bibliography. Communications in Statistics – Theory and Methods, 10(12):1205–1224, 1981. [Google Scholar]
Windham Michael P. Robustifying model fitting. Journal of the Royal Statistical Society. Series B (Methodological), 57(3):599–609, 1995. [Google Scholar]
Wolfowitz Jacob. The minimum distance method. The Annals of Mathematical Statistics, pages 75–88, 1957. [Google Scholar]
Yatracos Yannis G. Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics, 13(2):768–774, 1985. [Google Scholar]
Zellner Arnold. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS2061361-supplement-Supp_1.pdf^{(3.4MB, pdf)}

[R1] Shun’ichi Amari. Information Geometry and its Applications. Springer, 2016. [Google Scholar]

[R2] Angelucci Manuela, Karlan Dean, and Zinman Jonathan. Microcredit impacts: Evidence from a randomized microcredit program placement experiment by Compartamos Banco. American Economic Journal: Applied Economics, 7(1):151–182, 2015. [Google Scholar]

[R3] Medina Marco Avella and Ronchetti Elvezio. Robust statistics: A selective overview and new directions. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6):372–393, 2015. [Google Scholar]

[R4] Avella-Medina Marco and Ronchetti Elvezio. Robust and consistent variable selection in high-dimensional generalized linear models. Biometrika, 105(1):31–44, 2018. [Google Scholar]

[R5] Basu Ayanendranath and Lindsay Bruce G. Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, 46:683–705, 1994. [Google Scholar]

[R6] Basu Ayanendranath, Harris Ian R, Hjort Nils L, and Jones MC. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998. [Google Scholar]

[R7] Beath Ken J. A mixture-based approach to robust analysis of generalised linear models. Journal of Applied Statistics, 45(12):2256–2268, 2018. [Google Scholar]

[R8] Bernton Espen, Jacob Pierre E, Gerber Mathieu, and Robert Christian P. On parameter estimation with the Wasserstein distance. Information and Inference: A Journal of the IMA, 8(4):657–676, 2019. [Google Scholar]

[R9] Bondell Howard D and Stefanski Leonard A. Efficient robust regression via two-stage generalized empirical likelihood. Journal of the American Statistical Association, 108(502):644–655, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends^® in Machine learning, 3(1):1–122, 2011. [Google Scholar]

[R11] Bradic Jelena. Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electronic Journal of Statistics, 10(2):3894–3944, 2016. [Google Scholar]

[R12] Bradic Jelena, Fan Jianqing, and Wang Weiwei. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):325–349, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Briol Francois-Xavier, Barp Alessandro, Duncan Andrew B, and Girolami Mark. Statistical inference for generative models with maximum mean discrepancy. arXiv preprint arXiv:1906.05944, 2019. [Google Scholar]

[R14] Broderick Tamara, Giordano Ryan, and Meager Rachael. An automatic finite-sample robustness metric: When can dropping a little data make a big difference? arXiv preprint arXiv:2011.14999, 2023. [Google Scholar]

[R15] Cai Diana, Campbell Trevor, and Broderick Tamara. Finite mixture models do not reliably learn the number of components. In International Conference on Machine Learning, pages 1158–1169, 2021. [Google Scholar]

[R16] Chandra Noirrit Kiran, Canale Antonio, and Dunson David B. Escaping the curse of dimensionality in Bayesian model-based clustering. Journal of Machine Learning Research, 24(144):1–42, 2023. [Google Scholar]

[R17] Chen Mengjie, Gao Chao, and Ren Zhao. A general decision theory for Huber’s $ϵ$ - contamination model. Electronic Journal of Statistics, 10:3752–3774, 2016. [Google Scholar]

[R18] Cherief-Abdellatif Badr-Eddine and Alquier Pierre. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference, pages 1–21, 2020. [Google Scholar]

[R19] Choi Edwin, Hall Peter, and Presnell Brett. Rendering parametric procedures more robust by empirically tilting the model. Biometrika, 87(2):453–465, 2000. [Google Scholar]

[R20] Claeskens Gerda and Hjort Nils Lid. Model Selection and Model Averaging. Cambridge University Press, 2008. [Google Scholar]

[R21] Cover Thomas M and Thomas Joy A. Elements of Information Theory. John Wiley & Sons, second edition, 2006. [Google Scholar]

[R22] Csiszár Imre. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975. [Google Scholar]

[R23] Dembo Amir and Zeitouni Ofer. Large Deviations Techniques and Applications. Springer Berlin Heidelberg, 2010. [Google Scholar]

[R24] Donoho David L and Liu Richard C. The “automatic” robustness of minimum distance functionals. The Annals of Statistics, 16(2):552–586, 1988. [Google Scholar]

[R25] Dupuis Debbie J and Morgenthaler Stephan. Robust weighted likelihood estimators with an application to bivariate extreme value problems. Canadian Journal of Statistics, 30(1):17–36, 2002. [Google Scholar]

[R26] Field C and Smith B. Robust estimation: A weighted maximum likelihood approach. International Statistical Review/Revue Internationale de Statistique, 62(3):405–424, 1994. [Google Scholar]

[R27] Ghosh Abhik and Basu Ayanendranath. A new family of divergences originating from model adequacy tests and application to robust statistical inference. IEEE Transactions on Information Theory, 64(8):5581–5591, 2018. [Google Scholar]

[R28] Greco Luca and Agostinelli Claudio. Weighted likelihood mixture modeling and model-based clustering. Statistics and Computing, 30(2):255–277, 2020. [Google Scholar]

[R29] Hampel Frank R., Ronchetti Elvezio M., Rousseeuw Peter J., and Stahel Werner A.. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, 2005. [Google Scholar]

[R30] Huber Peter J. Robust estimation of a location parameter. Ann. Math. Statist, 35(4):73–101, 1964. [Google Scholar]

[R31] Huber Peter J. and Ronchetti Elvezio M.. Robust Statistics. Wiley, 2009. [Google Scholar]

[R32] Huber-Carol Catherine, Balakrishnan Narayanaswamy, Nikulin Mikhail, and Mesbah Mounir. Goodness-of-fit Tests and Model Validity. Springer Science & Business Media, 2012. [Google Scholar]

[R33] Hüllermeier Eyke. Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization. International Journal of Approximate Reasoning, 55(7):1519–1534, 2014. [Google Scholar]

[R34] Hüllermeier Eyke and Cheng Weiwei. Superset learning based on generalized loss minimization. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part II, pages 260–275, 2015. [Google Scholar]

[R35] Hunter David R and Lange Kenneth. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004. [Google Scholar]

[R36] Thi Hoai An Le and Dinh Tao Pham. DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68, 2018. [Google Scholar]

[R37] LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [Google Scholar]

[R38] Lienen Julian and Hüllermeier Eyke. Instance weighting through data imprecisiation. International Journal of Approximate Reasoning, 134:1–14, 2021. [Google Scholar]

[R39] Lindsay Bruce G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. The Annals of Statistics, 22(2):1081–1114, 1994. [Google Scholar]

[R40] Liu Allen and Moitra Ankur. Robustly learning general mixtures of Gaussians. Journal of the ACM, 70(3):1–53, 2023. [Google Scholar]

[R41] Liu Jiawei and Lindsay Bruce G.. Building and using semiparametric tolerance regions for parametric multinomial models. The Annals of Statistics, 37(6A):3644–3659, 2009. [Google Scholar]

[R42] Mancini Loriano, Ronchetti Elvezio, and Trojani Fabio. Optimal conditionally unbiased bounded-influence inference in dynamic location and scale models. Journal of the American Statistical Association, 100(470):628–641, 2005. [Google Scholar]

[R43] Markatou Marianthi. Mixture models, robustness, and the weighted likelihood methodology. Biometrics, 56(2):483–486, 2000. [DOI] [PubMed] [Google Scholar]

[R44] Markatou Marianthi, Basu Ayanendranath, and Lindsay Bruce G. Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association, 93(442):740–750, 1998. [Google Scholar]

[R45] Maronna Ricardo A, Douglas Martin R, Yohai Victor J, and Salibián-Barrera Matías. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019. [Google Scholar]

[R46] Metsis Vangelis, Androutsopoulos Ion, and Paliouras Georgios. Spam filtering with naive Bayes - which naive Bayes? In CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27–28, 2006, Mountain View, California, USA, 2006. [Google Scholar]

[R47] Miller Jeffrey W. and Dunson David B.. Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Olier Ivan, 2020. QSAR datasets – Meta-QSAR, V1. Mendeley Data; 10.17632/spwgrcnjdg.1. [DOI] [Google Scholar]

[R49] Olier Ivan, Sadawi Noureddin, Richard Bickerton G, Vanschoren Joaquin, Grosan Crina, Soldatova Larisa, and King Ross D. Meta-QSAR: A large-scale application of meta-learning to drug design and discovery. Machine Learning, 107(1):285–311, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Parikh Neal and Boyd Stephen. Proximal algorithms. Foundations and Trends^® in Optimization, 1(3):127–239, 2014. [Google Scholar]

[R51] Rogers David and Hahn Mathew. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, 2010. [DOI] [PubMed] [Google Scholar]

[R52] Rousseeuw Peter J and Leroy Annick M. Robust regression and outlier detection. John Wiley & Sons, 2005. [Google Scholar]

[R53] Royall Richard. Statistical evidence: a likelihood paradigm. Routledge, 2017. [Google Scholar]

[R54] Samdani Rajhans, Chang Ming-Wei, and Roth Dan. Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 688–698, 2012. [Google Scholar]

[R55] Satopaa Ville, Albrecht Jeannie, Irwin David, and Raghavan Barath. Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 31st International Conference on Distributed Computing Systems Workshops, pages 166–171, 2011. [Google Scholar]

[R56] She Yiyuan and Owen Art B. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639, 2011. [Google Scholar]

[R57] Simon-Gabriel Carl-Johann and Schölkopf Bernhard. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. The Journal of Machine Learning Research, 19(1):1708–1736, 2018. [Google Scholar]

[R58] Taplin Ross H. Robust likelihood calculation for time series. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4):829–836, 1993. [Google Scholar]

[R59] Tsou Tsung-Shan and Royall Richard M. Robust likelihoods. Journal of the American Statistical Association, 90(429):316–320, 1995. [Google Scholar]

[R60] Parr William C. Minimum distance estimation: a bibliography. Communications in Statistics – Theory and Methods, 10(12):1205–1224, 1981. [Google Scholar]

[R61] Windham Michael P. Robustifying model fitting. Journal of the Royal Statistical Society. Series B (Methodological), 57(3):599–609, 1995. [Google Scholar]

[R62] Wolfowitz Jacob. The minimum distance method. The Annals of Mathematical Statistics, pages 75–88, 1957. [Google Scholar]

[R63] Yatracos Yannis G. Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics, 13(2):768–774, 1985. [Google Scholar]

[R64] Zellner Arnold. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988. [Google Scholar]

PERMALINK

Robustifying Likelihoods by Optimistically Re-weighting Data

Miheer Dewaskar

Christopher Tosh

Jeremias Knoblauch

David B Dunson

Abstract

1. Introduction

Figure 1:

2. Optimistically Weighted Likelihoods

2.1. Population level Optimistic Kullback Leibler minimization

Assumption 2.1.

Definition 2.1.

Figure 2:

2.2. Optimistically Weighted Likelihood (OWL) estimation

2.2.1. Approximating OKL by a finite dimensional optimization problem

Theorem 1.

Theorem 2.

Proposition 1.

2.2.2. OWL Methodology

Algorithm 1.

2.2.3. Setting the corruption fraction ϵ

2.3. OWL extension to non-identically-distributed data

3. Asymptotic connection to coarsened inference

3.1. Asymptotic connection in finite spaces

Theorem 3.

3.2. Asymptotic connection in continuous spaces

Theorem 4.

4. Simulation Examples

Gaussian modeling.

Figure 3:

Linear regression.

Logistic regression.

Gaussian mixture models.

Bernoulli product mixture models.

5. Application to micro-credit study

Figure 4:

6. Discussion

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2.3. Setting the corruption fraction $ϵ$