Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 11.
Published before final editing as: J Am Stat Assoc. 2025 Apr 11:10.1080/01621459.2025.2468012. doi: 10.1080/01621459.2025.2468012

Robustifying Likelihoods by Optimistically Re-weighting Data

Miheer Dewaskar a,*,#, Christopher Tosh b,*, Jeremias Knoblauch c, David B Dunson d
PMCID: PMC12459652  NIHMSID: NIHMS2061361  PMID: 41001515

Abstract

Likelihood-based inferences have been remarkably successful in wide-spanning application areas. However, even after due diligence in selecting a good model for the data at hand, there is inevitably some amount of model misspecification: outliers, data contamination or inappropriate parametric assumptions such as Gaussianity mean that most models are at best rough approximations of reality. A significant practical concern is that for certain inferences, even small amounts of model misspecification may have a substantial impact; a problem we refer to as brittleness. This article attempts to address the brittleness problem in likelihood-based inferences by choosing the most model friendly data generating process in a distance-based neighborhood of the empirical measure. This leads to a new Optimistically Weighted Likelihood (OWL), which robustifies the original likelihood by formally accounting for a small amount of model misspecification. Focusing on total variation (TV) neighborhoods, we study theoretical properties, develop estimation algorithms and illustrate the methodology in applications to mixture models and regression.

Keywords: Coarsened Bayes, Data contamination, Mixture models, Model misspecification, Outliers, Robust inference, Total variation distance

1. Introduction

When the likelihood is correctly specified, there is arguably no substitute for likelihood-based statistical inferences (see e.g. Zellner, 1988; Royall, 2017). However, if the likelihood is misspecified, inferences may be flawed (Huber, 1964; Tsou and Royall, 1995; Huber and Ronchetti, 2009; Hampel et al., 2005; Miller and Dunson, 2019). This has motivated methods for model comparison and goodness-of-fit assessment (see e.g. Huber-Carol et al., 2012; Claeskens and Hjort, 2008). A key is to verify that the assumed likelihood is consistent with the data at hand. Yet, even with substantial care in model assessment, some amount of model misspecification is inevitable.

Unfortunately, even slight model misspecification can have dire consequences in certain settings; a problem we refer to as brittleness. Brittleness can occur in various applications, including in high dimensional problems (e.g. Bradic et al., 2011; Bradic, 2016), in the presence of outliers and contaminating distributions (e.g. Huber, 1964; Huber and Ronchetti, 2009; Hampel et al., 2005), or for mixture models (e.g. Markatou, 2000; Chandra et al., 2023; Liu and Moitra, 2023; Cai et al., 2021). This article is focused on robustifying likelihoods to avert such brittleness.

Figure 1 illustrates brittleness and our proposed solution in the setting of model-based clustering with kernel mixture models. Here, most of the data are perfectly modeled by a mixture of two well-separated Gaussians. However, a small fraction of the data have been corrupted, and are instead drawn uniformly from an interval between the two modes. As the left panel demonstrates, the maximum likelihood estimate (MLE) is sensitive to the corrupted data. Our proposal is to instead maximize an optimistically weighted likelihood (OWL), which slightly perturbs the data to be more consistent with the assumed model. The left panel shows that maximizing the OWL reduces brittleness, while the right panel shows the corresponding optimistically re-weighted data.

Figure 1:

Figure 1:

An example illustrating failures of maximum likelihood estimation (MLE) for misspecified models. The data consist of n=1000 observations randomly generated from an equally-weighted mixture of two Gaussians with means −2.5 and 2.5, and with standard deviation 1/ 4. 5% of the data were corrupted and drawn i.i.d uniform (−1,1). Left: The green dashed line denotes the MLE found using expectation maximization. The orange dashed line denotes the solution found by maximizing our optimistically weighted likelihood based on ε0=0.05. Right: Depicts the optimistically re-weighted data that were used in place of the observed data to compute the MLE.

Although in principle one can address the issue of misspecification by designing more flexible models, e.g. via mixture or nonparametric models (Taplin, 1993; Beath, 2018), such flexibility often comes at the expense of interpretability, identifiability, and efficiency, both computational and statistical. Moreover, there is always the risk that one’s model is not flexible enough and is therefore subject to the brittleness concerns outlined above. These issues motivate the core approach of the field of robust estimation: rather than designing a complicated model that fits all aspects of the data generating process, one instead fits a model that is only approximately correct, albeit with a different methodology. At the heart of many robust estimation methods is the concept of the weighted likelihood.

For a likelihood function f, the weighted likelihood replaces Lx1:nθ=i=1nfxiθ by the weighted counterpart

Lwx1:nθ=i=1nfxiθwi (1)

for a collection of weights w=wii=1n that typically depend on θ and the data. Estimation of θ and w often proceeds via iteratively reweighted maximum likelihood estimation (IRW-MLE) by repeating the following steps until convergence: a) maximize likelihood θLwX1:nθ based on the current w and, b) update w using the current θ (Avella Medina and Ronchetti, 2015; Maronna et al., 2019). A variety of classical robust methods (Huber and Ronchetti, 2009; Hampel et al., 2005) fall in this framework, providing weights measuring the influence of each observation (e.g., Figure 8 in Mancini et al., 2005). Typical practice assumes wi=wxi,θ with w() chosen so that wi0 for observations having very low likelihood under the presumed model; e.g., w(x,θ) is a function of the density f(xθ) (Windham, 1995) or its cumulative distribution function (Field and Smith, 1994; Dupuis and Morgenthaler, 2002). See See Supplement S1 for more discussion about these methods.

Typical robust estimation methods focus on sensitivity to outliers, while our interest is in broader types of misspecification. In Figure S1 from Supplement S2, we illustrate how misspecification in the shape of the kernel can lead to errors in selecting the number of components in mixture models. We aim to develop a general weighted likelihood methodology that allows for model misspecification beyond outliers. We focus on robustness to a small amount of misspecification with respect to an appropriate metric d in the space of probability distributions.

A well-studied approach to achieving robustness is minimum distance estimation (Wolfowitz, 1957; William, 1981), which attempts to find a model that minimizes the distance to the generating distribution with respect to the metric d. While this method can be shown to be robust to misspecification in d (Donoho and Liu, 1988), it is often challenging to compute (e.g. Yatracos, 1985) beyond suitable ϕ-divergences (Lindsay, 1994). In continuous spaces, when d is a ϕ-divergence with smooth ϕ, minimum distance estimation can be computed using weighted likelihood equations with w(x,θ)=A(δ(xθ))-A(-1)δ(xθ)+1 where A is the residual adjustment function corresponding to ϕ and δ(xθ)=pˆ(x)/f(xθ)-1 is the Pearson residual based on a density estimate pˆ of the observed data (Basu and Lindsay, 1994; Basu et al., 1998; Markatou et al., 1998; Greco and Agostinelli, 2020). Unfortunately, most ϕ-divergences outside the Hellinger and total-variation (TV) cases are not metrics, making interpretation difficult. Moreover, the important case of TV distance is excluded by this work since the corresponding ϕ is not smooth. Recent work (Cherief-Abdellatif and Alquier, 2020; Briol et al., 2019; Bernton et al., 2019) has provided tools for computing minimum distance estimators when d is the maximum mean discrepancy (MMD) or Wasserstein distance, however this approach relies on generative models and does not make use of the likelihood.

Here, we develop a weighted likelihood methodology that can perform approximate minimum distance estimation, providing a recipe for robustifying Bayesian and frequentist approaches by re-weighting the likelihood (see Supplement S2). For non-negative weights w=wii=1n that sum to n, our key observation is that the weighted likelihood (1) is an ordinary (i.e. un-weighted) likelihood for perturbed data represented by the weighted empirical distribution Qˆw=n-1i=1nwiδxi, where δx denotes the Dirac measure at x. Thus IRW-MLE simultaneously finds: i) an optimal data perturbation Qˆw*, and ii) the ordinary maximum likelihood estimate θˆ based on the perturbed data Qˆw*. For approximate minimum distance estimation, we require the perturbations Qˆw to lie in a (d,ϵ) neighborhood around the empirical distribution of the observed data Pˆ=n-1i=1nδxi for a choice of ϵ>0 given by the user.

Our optimal weights w* are determined using the principle of optimism: in the absence of information other than constraint d(Qˆw,Pˆ)ϵ, we choose the perturbation Qˆw closest to the model family in the Kullback Leibler (KL) sense. If PθθΘ denotes our model family and KL is an estimator for the KL divergence, we (Algorithm 1) adapt the IRW-MLE procedure to solve the following Optimistic Kullback Leibler (OKL) minimization:

minθΘminw:d(Qˆw,Pˆ)ϵKL(QˆwPθ). (2)

When ϵ=0, OKL minimization is equivalent to the usual MLE, and the choice ϵ>0 reflects the degree of misspecification. If P0 is the true distribution generating x1,,xn, and ϵ is greater than the degree of misspecification ε0minθΘdP0,Pθ, then any global minimizer θˆ of (2) can be asymptotically shown to lie in the set θ:dP0,Pθϵ (Proposition 1). Thus when ϵ is tuned to approximate ε0 (Section 2.2.3), θˆ will be a minimum distance estimator.

In this work, we show that the OKL is intimately related to the concept of the coarsened likelihood (Miller and Dunson, 2019) – a genuine likelihood that acknowledges small misspecification in terms of d. While even evaluating the coarsened likelihood requires the computation of a high-dimensional integral, we use techniques from large deviation theory (Dembo and Zeitouni, 2010) to show that the coarsened likelihood is asymptotically equivalent to the OKL. Thus, the OWL methodology can be seen as a bridge, connecting minimum distance estimation and weighted likelihood methods on one side and the coarsened likelihood approach on the other.

When d(Q,P)=logdQdPdQ is the KL divergence, the Lagrangian formulation of our OKL minimization problem corresponds to minimizing a Cressie-Read power-divergence between the observations and the model family (Ghosh and Basu, 2018). Since different computational strategies may be required across metrics, we focus on total variation (TV) distance for d, which is easy to interpret, is a proper metric, and is robust to outliers and contamination (Donoho and Liu, 1988). Further, TV based estimators have been shown to attain minimax estimation rates under Huber’s contamination model (Chen et al., 2016).

Coarsened likelihoods (Miller and Dunson, 2019) in general, and the principle of optimism used here in particular, are related to statistical learning from fuzzy or imprecise data (Hüllermeier and Cheng, 2015; Hüllermeier, 2014); the key idea is to treat observed data as imprecise to facilitate robust learning (Lienen and Hüllermeier, 2021). Among various approaches to robust inference in regression (e.g. She and Owen, 2011; Rousseeuw and Leroy, 2005; Avella-Medina and Ronchetti, 2018), Bondell and Stefanski (2013) implements weighted least squares with weights learned within an empirical likelihood framework subject to a target weighted squared residual error. While OKL minimization (2) finds weights in a fashion similar to this and other empirical likelihood approaches (e.g. Choi et al., 2000), our optimization problem arises from large deviation formulas and leads to weights that are constrained to lie within a small total-variation 1 distance of the uniform weight vector.

The remainder of this paper is structured as follows: Section 2 presents the OWL methodology, the associated alternating optimization scheme, and discusses some formal robustness guarantees. Section 3 demonstrates the asymptotic connection between OWL and coarsened inference. Section 4 presents a suite of simulation experiments for the OWL methodology in both regression and clustering tasks. Section 5 uses OWL to infer the average intent-to-treat effect in a micro-credit study (Angelucci et al., 2015), whose inference using ordinary least squares (OLS) was shown to be brittle to the removal of an handful of observations (Broderick et al., 2023). Finally, we provide a robust clustering application for single-cell RNAseq data in Supplement S12.

2. Optimistically Weighted Likelihoods

As Figure 1 shows, maximum likelihood estimation can be brittle when the data generating distribution P0 lies outside, albeit suitably close, to the model family PθθΘ. To assuage the problem of brittleness under misspecification, we propose an Optimistically Weighted Likelihood (OWL) approach that iterates between (1) an optimistic (or model-aligned) re-weighting of the observed data points and (2) updating the parameter estimate by maximizing a weighted likelihood based on the current data weights.

In Section 2.1, we study this parameter inference methodology at the population level where we formally allow P0 to be misspecified as long as it is close to the model family in the total-variation (TV) distance. Here we introduce the population level Optimistic Kullback Leibler (OKL) function with parameter ϵ[0,1] (Definition 3), and show that its minimizer will be a parameter θ for which Pθ is ϵ-close to P0 in TV distance. We provide formal robustness guarantees for these minimizers by analyzing a bias-distortion curve.

Motivated by this population analysis, in Section 2.2, we derive the OWL-based parameter estimation methodology when only samples x1,,xn from P0 are available.

2.1. Population level Optimistic Kullback Leibler minimization

Let 𝒫(𝒳) denote the space of probability distributions on the data space 𝒳. In principle, our methodology can accommodate misspecification in terms of a variety of probability metrics (e.g. MMD or Wasserstein) on 𝒫(𝒳), but here we mainly focus on the total variation (TV) distance for concreteness and interpretability. Let dTV(P,Q)=supA𝒳|P(A)-Q(A)| denote the TV metric between two probability distributions P,Q𝒫(𝒳), where the supremum ranges over all measurable subsets A of 𝒳. Given a model family PθθΘ𝒫(𝒳), we assume that the data generating distribution P0𝒫(𝒳) for the data population in question satisfies dTV(P0,Pθ*)ϵ for a known value ϵ0 and some unknown θ*Θ. In other words, we make the following assumption (satisfied under Huber’s ϵ-contamination model (Huber, 1964) given by P0=(1-ϵ)Pθ*+ϵC for any C𝒫(𝒳)).

Assumption 2.1.

Given ϵ0 and the true data distribution P0, the set of parameters ΘI=θΘdTVP0,Pθϵ is non-empty.

In terms of Liu and Lindsay (2009), Assumption 2.1 requires that our model be adequate for the data at level ϵ in terms of the TV neighborhood. While Liu and Lindsay (2009) develop a likelihood-ratio test for this assumption, here we focus on estimating parameters by maximizing a natural likelihood (Section 3), leading to a different optimization problem.

In general, under Assumption 2.1, it may only be possible to identify the set ΘI, rather than any particular θ*Θ. Although such indeterminacy may be inherent, it is practically insignificant whenever ϵ is sufficiently small so that the distinction between two elements from ΘI is irrelevant (Huber, 1964). In line with this insight, the goal throughout the rest of the paper will be to identify some parameter in ΘI. Indeed, selecting ϵ to approximate minθdTVP0,Pθ (see Section 2.2.3) will recover a minimizer of the map θdTVP0,Pθ.

At the population level, maximum likelihood parameter estimation amounts to minimizing the Kullback Leibler (KL) function θKLP0Pθ on the parameter space Θ. Even under small amounts of misspecification, KL minimizers are very brittle since any minimizer of the KL function must place sufficient probability mass wherever P0 does, including on outliers. In contrast, TV distance is far less sensitive to outliers and slight shifts in the shape of the distribution. Hence one may minimize θdTVP0,Pθ as a robust alternative, particularly under Assumption 2.1. However, direct minimization of TV distance over the parameter space Θ is difficult to implement in practice due to the lack of suitable optimization primitives (e.g. maximum likelihood estimators) and the non-convex and non-smooth nature of the optimization problem (see e.g. Yatracos, 1985).

Instead, we modify the KL loss by minimizing its first argument over the ϵ-neighborhood of P0 in the TV distance. The resulting function, which we term the Optimistic Kullback Leibler (OKL), is defined as follows.

Definition 2.1.

(Optimistic Kullback Leibler) Given P0 and ϵ>0, the OKL function Iϵ:Θ[0,] is defined as:

Iϵ(θ)=infQϵP0KLQPθ, (3)

where ϵP0=Q𝒫(𝒳)dTVP0,Qϵ is the TV ball of radius ϵ around P0. If Iϵ(θ)<, the underlying optimization over ϵP0 has a unique minimizer Qθ called the I-projection (Csiszár, 1975).

The OKL function θIϵ(θ) measures the fit of a model Pθ to the data P0, allowing for a degree ε of data re-interpretation in the TV distance before assessing model fit. More precisely, since Iϵ(θ)=KL(QθPθ), the OKL is equal to the KL divergence between the optimistic (i.e. most model aligned) distribution QθϵP0 in the neighborhood of P0 and the model Pθ. Here, ϵ0 regulates the permitted degree of re-interpreting the data by controlling the neighborhood size.

The OKL function can be used to find a parameter from the set ΘI. Indeed, as illustrated in Figure 2 (right), the minimum value of zero for the OKL function is attained exactly on the set ΘI since Qθ=Pθ if and only if θΘI. However, the OKL can be non-convex as a function of θ, so that calculating the global minimizer of OKL may not be straightforward. Fortunately, the OKL lends itself to a feasible alternating optimization scheme (Figure 2; left) that is guaranteed to decrease the OKL objective or reach a saddle point under regularity conditions.

Figure 2:

Figure 2:

Population level description of OWL (left) and the OKL function (right). The point Pθ is labeled using θΘ. The inference problem is to find a point in ΘIΘ where the model family intersects the ϵ-neighborhood ϵ of the data distribution P0. The set ΘI is the set of minimizers of the OKL function θIϵ(θ) (right). Starting from initial point θ1Θ, the OWL procedure finds a saddle point of the OKL function by iterating θt+1=argminθΘKL(QθtPθ) for t=1,2,, until convergence, where Qθ=argminQϵKLQPθ denotes the information projection (Amari (2016)) of the model Pθ on the total variation (TV) neighborhood ϵ of P0. Iterations alternate between I-projection and weighted likelihood estimation steps, illustrated via solid and dashed lines.

We now summarize formal robustness guarantees for the global minimizers of the OKL function against perturbations of P0 in small TV-neighborhoods around the model family; details can be found in Supplement S4. We call the mapping from P0 to the global minimizers of the OKL function as the OWL functional 𝒯ϵ:𝒫(𝒳)2Θ, defined as 𝒯ϵP0argminθΘIϵ(θ) where 2Θ denotes the power-set of Θ. Motivated by the robustness theory for minimum distance functionals (Donoho and Liu, 1988), when P0=Pθ* lies on the model family, we bound the growth of a bias-distortion (BD) curve δBDϵδP0 defined as BDϵδP0supP𝒫(𝒳)dTVP,P0δHaus𝒯ϵ(P),𝒯ϵP0, where Haus denotes the Hausdorff distance on 2Θ. Indeed, classically important robustness indicators like sensitivity and qualitative-robustness are properties about the growth of a suitable BD curve when δ is close to zero, and the breakdown-point is the point at which the BD curve has a vertical asymptote (Donoho and Liu, 1988, Figure 1). Given this setup, under suitable regularity conditions, our two main results in Supplement S4 show that the OWL functional is qualitatively-robust with finite sensitivity (corollary of Lemma S8) and has a breakdown-point of at least ϵ whenever ϵ is less than half of the best breakdown-point for the model family among all Fisher consistent functionals (Lemma S8).

2.2. Optimistically Weighted Likelihood (OWL) estimation

Here we extend the population level methodology from Section 2.1 to handle the practical case when samples x1,,xn~i.i.d.P0 are available, providing a computable approximation to the alternating procedure in Figure 2. Particularly, the I-projection step is now approximated by a convex optimization problem over weight vectors constrained to lie within the intersection of the n-dimensional probability simplex and the 1 ball of radius 2ϵ around the uniform probability vector; these optimal weights can be interpreted as an optimistic re-weighting of the original data points x1,,xn to match the current model estimate. Given these weights, a new parameter estimate is then found by maximizing the weighted likelihood. We call this the Optimistically Weighted Likelihood (OWL) method.

2.2.1. Approximating OKL by a finite dimensional optimization problem

Given observed data x1,,xn~i.i.d.P0, we start by approximating the OKL function Iϵ(θ) in terms of a finite dimensional optimization problem. Henceforth, let us assume that the model family PθθΘ and measure P0 have densities pθθΘ and p0 with respect to a common measure λ. We will focus on two cases of interest: when 𝒳 is a discrete space and λ is the counting measure, and when 𝒳=Rd and λ is the Lebesgue measure.

When 𝒳 is discrete, we look to solve the optimization problem in eq. (3) over data re-weighting Q=i=1nwiδxi as the weight vector w=w1,,wn varies over the n-dimensional probability simplex Δn and satisfies the TV constraint 12w-o1ϵ where o=(1/n,,1/n)Δn. Formally, our finite space OKL approximation is given by

Iˆϵ,fin(θ)=infwΔn12w-o1ϵi=1nwilognwipˆfinxipθxi, (4)

where pˆfin(y)=i[n]xi=yn is the histogram estimator for p0 based on observed data. An application of the log sum inequality (Cover and Thomas, 2006, Theorem 2.7.1) shows that the weights that solve eq. (4) have the appealing and natural property that wi=wj whenever xi=xj. Moreover, when the support of p0 is finite and contains the support of pθ,Iˆϵ,fin(θ) converges to Iϵ(θ) at rate n-1/2, as demonstrated by the following result (Supplement S5). A modified proof can establish the weaker result of consistency of Iˆϵ,fin(θ) even when the support of p0 is (countably) infinite, provided supppθsuppp0. The latter condition is mild and holds whenever the model family has a fixed support (e.g. exponential family) and p0 is a Huber’s contamination from the model.

Theorem 1.

Suppose that Iϵ0(θ)< for some ϵ0>0 and pick δ>0 and ϵ>ϵ0. If supppθsuppp0, suppp0 is finite, and x1,,xn~i.i.d.p0, then with probability at least 1-δ,

|Iϵ(θ)-Iˆϵ,fin(θ)|Osuppp0ϵ-ϵ01nlogsuppp0δ,

where supp(p)={x𝒳:p(x)>0}.

When 𝒳=Rd, the above approximation strategy needs to be modified, since KLQPθ is unbounded whenever Pθ is a continuous distribution and Q=i=1nwiδxi is a discrete distribution. However, using a suitable kernel κ:𝒳×𝒳[0,) (e.g. κh(x,y)=1(2πh)de-x-y22h2) one can establish the continuous space approximation (Supplement S6)

Iˆϵ(θ)=infwΔˆn12w-o1ϵi=1nwilognwipˆxipθxi (5)

where pˆ is a density estimator for p0 based on x1,,xn,A is an n×n matrix with entries Aij=κxi,xjnpˆxi, and Δˆn=AΔn is the image of the n-dimensional probability simplex under linear operator A. We will typically take pˆ()=1nj=1nκ,xj to be the kernel-density estimate based on the same kernel κ, in which case A is the stochastic matrix obtained by normalizing the rows of the kernel matrix K=κxi,xji,j[n] to sum to one.

The continuous space approximation in eq. (5) yields the finite space approximation in eq. (4) as a special case when κ(x,y)=I{x=y} is taken to be the indicator kernel. The weights vectors in Δˆn are always non-negative and approximately sum to one for large values of n, since i=1nAij=1ni=1nκxi,xjpˆxiκx,xjpˆ(x)p0(x)dxκx,xjdx=1.

Now we briefly describe the large sample convergence of Iˆϵ(θ) to Iϵ(θ) shown in Supplement S6.. Proving this exact statement is technically challenging, as it requires proving uniform-convergence of the objective in eq. (5) when the weights are allowed to vary over the entire range Δˆn. To get around this, we restrict the optimization domain to Δˆnβ=AΔnβ where Δnβ={v1,,vnΔn:vi[βn,1nβ]} for a suitably small constant β>0. Thus, the estimator that we theoretically study is given by

Iˆϵ,β(θ)=infwΔˆnβ12w-o1ϵi=1nwilognwipˆxipθxi. (6)

Given this change, we can show the following result.

Theorem 2.

Suppose suppp0=supppθ=𝒳 is a compact subset of Rd and there exists a constant γ>0 such that p0(x),pθ(x)[γ,1/γ] for all x𝒳. Suppose that we use the probability kernel κ(x,y)=1hdϕ(|x-y|/h) having bandwidth h>0, where κ is positive semi-definite and ϕ is a bounded function with exponentially decaying tails. Assume that p0 and logpθ are α-Hölder smooth over 𝒳, and suppose that we use the clipped density estimator pˆ(x)=minmax1ni=1nκxi,x,γ,1/γ. Then for any constant 0<βγ2/4

Iˆϵ,βθ-IϵθO˜n-12h-d+hα2+ψh,

with probability at least 1-1/n, where ψr=λ𝒳\𝒳-rλ𝒳,λ() is the d-dimensional Lebesgue measure, 𝒳-r={x𝒳:B(x,r)𝒳}, and O˜() hides constants and logarithmic factors.

Here ψ(r) measures the fraction of the volume of 𝒳 that is contained in the envelope of width r closest to the boundary. For well-behaved sets, we expect ψ(r) to decrease to 0 as r0. For example, if 𝒳 is a d-dimensional ball of radius r0, then ψ(r)=1-(1-rr0)d.

We prove our theory with the truncated estimator Iˆϵ,β(θ) instead of Iˆϵ(θ) for technical reasons. By a suitable version of the sandwiching lemma (Supplement S3.2), under the assumptions of Theorem 2, we conjecture that the optimal weights in (5) lie in the set Δˆnβ for a small enough constant β>0, in which case we will have Iˆϵ,β(θ)=Iˆϵ(θ).

Given the large sample convergence of Iˆϵ(θ) to Iϵ(θ) for every θΘ, one expects that any estimator θˆΘ that is a global minimizer of Iˆϵ will approach the set ΘI (Assumption 2.1) as n. A qualitative version of this result, proved in Supplement S7, is shown next.

Proposition 1.

Suppose that (i) 𝒳 and Θ are compact metric spaces, and (x,θ)logpθ(x) is a continuous function from 𝒳×ΘR. Let ϵ0 be such that (ii) Assumption 2.1 is satisfied. Let x1,,xn~i.i.d.p0, and suppose that (iii) for any fixed θΘ, we have Iˆϵ(θ)PIϵ(θ) as n. If an estimator θˆΘ satisfies Iˆϵ(θˆ)-minθΘIˆϵ(θ)P0, then for every δ>0

limnPdTVPθˆ,P0>ϵ+δ=0.

Theorem 1 and Theorem 2 (with the conjectured β=0) provide sufficient conditions for condition (iii) in Proposition 1 to hold. Condition (i) is primarily used to show that the convergence in (iii) is uniform, i.e. supθΘ|Iˆϵ(θ)-Iϵ(θ)|P0. Finally, condition (ii) is satisfied when the data P0=(1-ϵ)Pθ*+ϵC is a contamination. Proposition 1 and triangle inequality then bound the model estimation error: dTV(Pθˆ,Pθ*)2ϵ+oP(1).

2.2.2. OWL Methodology

Based on the computable approximation Iˆϵ (or Iˆϵ,fin) to the OKL function, Algorithm 1 uses an alternating optimization to minimize Iˆϵ(θ) over θ. The procedure simultaneously estimates a minimizer θˆ and optimistic weights w1,,wn0 that sum to n.

In Supplement S8, we expand further on the computational details for the θ-step and w-step in Algorithm 1. In particular, the w-step is a convex optimization problem which we can solve by using the consensus ADMM algorithm (Boyd et al., 2011; Parikh and Boyd, 2014) based on proximal operators that can be efficiently computed. Further, the θ-step corresponds to maximizing a weighted likelihood, which can be performed for many models through simple modifications of procedures for the corresponding maximum likelihood estimation. As shown in Supplement S8, the latter computation is easy to perform exactly for exponential families, while for mixture models we can use a weighted variant of the ‘hard EM’ algorithm (Samdani et al., 2012).

In Supplement S10, we show that the iterates θtt1 of Algorithm 1 will decrease the objective value θIˆϵ(θ) at each step, and that Algorithm 1 is an instance of the well-studied Majorization-Minimization (MM) (Hunter and Lange, 2004) and Difference of Convex (DC) (Le Thi and Pham Dinh, 2018) class of algorithms. We use this to analyze the convergence of θtt1 when pθθΘ is an exponential family (see Remark S10.1).

Algorithm 1.

OWL Methodology

Input: Model pθθΘ, radius ϵ0, kernel κ, initial θ1Θ, and iteration limit T.

for t=1,,T do

w-step: Find wt=wt,1,,wt,nR0n that minimizes eq. (5) for θ=θt.

θ-step: Find θt+1 that maximizes the weighted likelihood θi=1nwt,ilogpθxi.

end for

Output: The robust parameter estimate θT and the data weights nwT.

In practice, when the data lie in a continuous space, we often avoid using the kernel-based estimator eq. (5) to determine the weights in the w-step of Algorithm 1 because it greatly slows down the computation (see Supplement S8.1.3). Instead, we often perform the w-step by solving the unkernelized optimization problem using κ(x,y)=I{x=y}. We demonstrate via simulations (Section 4) that the unkernelized version of the OWL procedure has equally good performance compared to the kernelized version with a suitably tuned bandwidth. The theoretical impact of this approximation can be explored by studying the limiting behavior of the OKL estimator for suitably fixed choices of κ as n.

2.2.3. Setting the corruption fraction ϵ

So far, we have assumed that the parameter ϵ(0,1), which can be interpreted as the fraction of corrupted samples in the population distribution, is fixed at a known value that satisfies Assumption 2.1. Now let us see how the population level analysis (Section 2.1) can inform our choice of ϵ. Assumption 2.1 is satisfied as long as ϵε0, where

ε0=minθΘdTVP0,Pθ=ϵ[0,1]:minθΘIϵ(θ)=0.

Hence, in principle, we could set ϵ=ε0 to use OWL to perform minimum-TV estimation (Yatracos, 1985), which has the following advantages: (1) while directly minimizing TV distance is computationally intractable, the OWL methodology decomposes this problem into alternating convex optimization and weighted MLE steps, both of which are standard problems that often tend to be well-behaved, and (2) the OWL methodology provides us with weight vectors that can indicate outlying observations and relates minimum TV-estimation to likelihood based inference.

In order to choose ϵε0 in practice, we define the function gˆ(ϵ)=Iˆϵ(θˆϵ), where θˆϵ is the parameter estimate computed by the OWL procedure for a given ϵ. At the population level, the corresponding function g(ϵ)=minθΘIϵ(θ) is monotonically decreasing in ϵ until ϵ=ε0, at which point it remains at 0. This introduces a kink, or elbow, at ε0 that we hope to identify in the sample estimate gˆ. Thus, our ϵ search procedure is to compute gˆ over a fixed grid of ϵ-values, smooth the resulting grid, and then select amongst the points of largest curvature (computed numerically), where the curvature of a twice-differentiable function f at a point x is given by f(x)/(1+f(x)2)1.5 (Satopaa et al., 2011). Despite the various approximations involved, our simulation results (Section 4) show that the OWL procedure with such a tuned value of ϵ provides almost identical performance when compared with the OWL procedure with the true value of ϵ.

2.3. OWL extension to non-identically-distributed data

While the population level analysis and theoretical results for the OKL estimator were derived under the assumption that data are generated i.i.d. from a distribution P0, the OWL procedure can be adapted to robustify likelihood based inference in the setting where the data are conditionally independent, but not necessarily identically distributed.

Suppose data z1,,zn𝒵 are conditionally independent, with the likelihood having the product form pθz1:n=i=1npθ,izi, for known functions pθ,ii=1n. For example, if zi=yi,xiR×𝒳 for i=1,,n, this includes the case of regression models qθ(yx)θΘ under the setup pθ,izi=qθyixi. Another example of this setup includes mixture models if we expand the parameter space to also include cluster assignments (see Supplement S8.2).

To robustify inference based on the product likelihood pθz1:n=i=1npθ,izi, we can replace the w- and θ- steps in Algorithm 1 by analogous steps in the product likelihood case. In particular, the modified w-step is given by

wt=argminwΔn12w-o1ϵ{-i=1nwilogpθt,ixi+i=1nwilogwi}

and the modified θ-step is given by

θt+1=argminθΘi=1nwt,ilogpθ,ixi.

Despite our lack of theory in the non-identically-distributed case, we continue to see good empirical performance of OWL (see Section 4, Section 5, and Supplement S12).

3. Asymptotic connection to coarsened inference

The development of the OWL methodology in Section 2 followed from a presumed form of misspecification given by Assumption 2.1. An alternative way to frame and address such misspecifications in a probabilistic framework was proposed by Miller and Dunson (2019) who introduced a Bayesian methodology centered around the concept of a coarsened likelihood defined as

Lϵθx1:nPθd(PˆZ1:n,Pˆx1:n)ϵ, (7)

where d is a suitably chosen discrepancy between empirical probability measures. Here, Pˆx1:n=n-1i=1nδxi denotes the empirical distribution of data x1:n, and the probability is computed under Pθ—the distribution underlying the artificial data Z1,Zn~i.i.d.Pθ from which the random measure PˆZ1:n=n-1i=1nδZi is constructed. The coarsened likelihood implicitly captures the likelihood of a probabilistic procedure in which idealized data are first generated by some model Pθ in the model class under consideration, but are then corrupted in such a way that the discrepancy between empirical measures of the idealized data and the observed data is bounded by ϵ.

When d is an estimator for the KL-divergence and an exponential prior is placed on ϵ, Miller and Dunson (2019) showed that the Bayes posterior based on Lϵθx1:n could be approximated by raising the likelihood to a power less than one in the formula for the standard posterior. However, to obtain a robustified alternative to maximum likelihood estimation, one may wish to maximize θLϵθx1:n directly for a choice of d that guarantees robustness (e.g. Maximum Mean Discrepancy or the TV distance). Such an approach would in general be quite challenging since evaluating eq. (7) corresponds to computing a high-dimensional integral.

In this section, we show that for large n, the coarsened likelihood can be approximately maximized via the OWL methodology when d is an estimator for the TV distance. Specifically, if the observed data x1,,xn are generated i.i.d. from some distribution P0 and d satisfies some regularity conditions, then the quantity -1nlogLϵθx1-n asymptotically converges as n to a variant of Iϵ(θ) based on d. Hence, the OWL methodology asymptotically maximizes the coarsened likelihood θLϵθx1:n.

We state this asymptotic connection first for finite spaces and then for continuous spaces. These results rely on Sanov’s theorem from large deviation theory (Dembo and Zeitouni, 2010) and are proved in Supplement S9.

3.1. Asymptotic connection in finite spaces

Let 𝒳 be a finite set and denote the space of probability distributions on 𝒳 by the simplex Δ𝒳{q[0,1]𝒳χ𝒳q(x)=1}. Let pθθΘΔ𝒳 denote the collection of model distributions, and p0Δ𝒳 denote the true data generating distribution. To connect the OKL with the coarsened likelihood in eq. (7), we take d(p,q)=12p-q1.

Given this setting, we can show that -1nlogLϵθx1:n converges in probability to the OKL function Iϵ(θ) at a rate of n-1/2, as demonstrated by the following theorem.

Theorem 3.

Suppose that Iϵ0(θ)< for some ϵ0>0 and let δ>0. If ϵ>ϵ0 and x1,x2,,xn~i.i.d.p0, then with probability at least 1-δ,

Iϵ(θ)+1nlogLϵθx1:nO𝒳ϵ-ϵ01nlog1δ+𝒳n𝒳+logn+1ϵ-ϵ0.

Theorem 1 and Theorem 3 together show that, in the large sample limit, the OWL methodology and coarsened likelihood philosophy are two sides of the same coin: they both provide approximations of the OKL and, in turn, must approximate each other.

3.2. Asymptotic connection in continuous spaces

Suppose 𝒳 is a Polish space (e.g. 𝒳=Rd). Similar to the finite case, we can establish the following asymptotics for the coarsened likelihood for a suitable class of discrepancies d, which includes the Wasserstein distance, Maximum Mean Discrepancy with suitable choice of kernels (Simon-Gabriel and Schölkopf, 2018), and the smoothed TV distance (Definition S9.1 in Supplement S9.3).

Theorem 4.

Suppose Iϵ0(θ)< for some ϵ0>0 and d:𝒫(𝒳)×𝒫(𝒳)[0,) is a pseudometric that is convex in its arguments and continuous with respect to the weak convergence topology on 𝒫(𝒳). If ϵ>ϵ0 and x1,,xn~i.i.d.P0, then

-1nlogLϵθx1:nPinfQ𝒫𝒳dQ,P0ϵKLQPθasn.

Recall that the limiting expression in the above theorem has the same form as that of the OKL function given in eq. (3). However, in order to establish connection between the OKL function and the coarsened likelihood, unlike in the finite case, we cannot merely take the discrepancy d in the coarsened likelihood to be the TV distance, since the TV distance between the two empirical distributions in eq. (7) will almost surely be equal to one. Instead, we will take d to be a smoothed version of TV distance calculated by first convolving the empirical measures with a smooth kernel function Kh:𝒳×𝒳[0,) indexed by a bandwidth parameter h>0. Further details can be found in Supplement S9.3.

4. Simulation Examples

We now demonstrate the OWL methodology in simulated examples with artificially injected corruptions. In each simulation, we considered two methods for choosing the points to corrupt: (i) max-likelihood corruption where we fit a maximum likelihood estimate to the uncorrupted data and select the points with the highest likelihood; and (ii) random corruption where we choose the points to corrupt uniformly at random. We ran each simulation with 50 initial seeds, plotting the mean performance and its 95% confidence band. For clarity and space, we present only the results for max-likelihood corruptions in this section, and we defer the results for randomly-selected corruptions to Supplement S11.

In all comparisons, OWL refers to our methodology with the data-based choice of the corruption fraction ϵ as described in Section 2.2.3, while OWL (ϵ known) refers to our methodology with ϵ equal to the true level of corruption in the data. MLE refers to the standard maximum likelihood estimate. The Pearson residuals baseline is the method from Markatou et al. (1998) based on the Hellinger residual adjustment function: a weighted-likelihood based method aimed at performing minimum Hellinger-distance estimation. For settings where the outcomes were continuous, the Pearson residual density estimate was computed via kernel density estimation, with the bandwidth parameter selected to minimize the empirical Hellinger distance between the final model and the density estimate. For the linear regression baselines, ridge regression was performed with L2 penalty chosen via cross validation, and Huber regression used the commonly-accepted Huber penalty of 1.345 (Huber, 1964). For the logistic regression baselines, L2-regularized MLE selected the L2 penalty via cross validation. For both regression settings, Random Sample Consensus (RANSAC) utilized the ground-truth corruption fraction.

Gaussian modeling.

Our first simulation fit a multivariate normal distribution to data generated from a p-dimensional spherical normal distribution. The ground truth mean was drawn uniformly from [-10,10]p, and the corrupted data points were also drawn uniformly from [-10,10]p. The total number of points was 200, and we measured the dimension-normalized mean-squared error (MSE) of ground-truth mean parameter recovery.

Figure 3a shows the results of the Gaussian simulation. In particular, we see that both the kernelized and the unkernelized versions of OWL had almost identical performance in this example, achieving substantial robustness to corruptions over the sample mean. In light of this similarity of performance, the rest of our simulation examples utilize the much more computationally efficient unkernelized version of OWL. We also observed that the kernelized version of OWL was not particularly sensitive to the choice of bandwidth parameter (Figure S3). We further observe that on this example, the Pearson residuals method outperforms OWL, although the difference in performance is small relative to the gains of all methods over MLE.

Figure 3:

Figure 3:

Simulation results for max-likelihood corruptions. (a) Mean parameter reconstruction for a multivariate normal model. (b) Test MSE for linear regression. (c) Test accuracy for logistic regression. (d) Mean parameter reconstruction for mixture models. In all figures, the dashed black line denotes median performance of MLE on full uncorrupted training set, and the shaded regions denote bootstrapped 95% confidence intervals over 50 random seeds.

Linear regression.

We considered a homoscedastic model with normally distributed errors. We considered two datasets. The first is a simulated dataset with 10-dimensional i.i.d. standard normal covariates. The ground-truth regression coefficients were drawn independently from 𝒩(0,4), the intercept was set to 0, and the residual standard deviation was σ=1/4. The training set consisted of 1,000 data points. For the test set, we drew 1,000 new data points and computed the MSE on the underlying mean response.

The second dataset was taken from a quantitative structure activity relationship (QSAR) dataset (Olier, 2020) compiled by Olier et al. (2018) from the ChEMBL database. It consists of 5012 chemical compounds whose activities were measured on the epidermal growth factor receptor protein erbB1. The activities were recorded as the negative log of the chemical concentration that inhibited 50% of the protein target, i.e. the pIC50 value. Each compound had 1024 associated binary covariates, corresponding to the 1024-dimensional FCFP4 fingerprint representation of the molecule (Rogers and Hahn, 2010). We used PCA to reduce the dimension to 50. For every random seed, we computed a random 80/20 train/test split. The test MSE on this dataset is the standard MSE over the test responses. In both datasets, for each data point selected to be corrupted, we corrupted the responses by fitting a least squares solution and observing the residuals: if the residual is positive, we set the response to be 3v where v is the largest absolute value observed value in the training set responses, otherwise setting it to -3v.

Figure 3b shows the results of the linear regression simulations for the max-likelihood corruptions. Across both datasets, we see that OWL is competitive with the best of the robust regression methods, whether that method is RANSAC or Huber regression.

Logistic regression.

For the logistic regression setting, we have parameters bR,wRp and observations xi,yiRp+1 assumed to follow the distribution

yi~Bernoulli11+exp-xi,w-b.

We considered three datasets. The first is a simulated dataset using the same parameters as the linear regression setting. The training labels are created according to the generative model. For test accuracy, we computed accuracy against the true sign-values, i.e. Iw,x0.

The second dataset is taken from the MNIST handwritten digit classification dataset (LeCun et al., 1998). We considered the problem of classifying the digit ‘1’ v.s. the digit ‘8,’ resulting in a dataset with 14702 data points and 784 covariates, representing pixel intensities. The third dataset is a collection of 5172 documents from the Enron spam classification dataset, preprocessed to contain 5116 covariates, representing word counts (Metsis et al., 2006). For both the MNIST and the Enron spam datasets, we reduced the dimensionality to 10 via PCA and used a random 80/20 train/test split.

Figure 3c shows the results of the logistic regression simulations. Across all datasets, OWL again outperforms the other approaches in the presence of corruption.

Gaussian mixture models.

Recall the standard Gaussian mixture modeling setup: there are a collection of means μ1,,μKRp, standard deviations σ1,,σK>0, and mixing weights πΔK. Data points xiRp are drawn i.i.d. according to

Xi~k=1Kπk𝒩μk,σk2Ip.

For our simulations, we generated a synthetic dataset of 1000 points in R10 by first drawing K=3 means μ1,,μK whose coordinates are i.i.d. Gaussian with standard deviation 2. We set σ1=σ2=σ3=1/2 and π1=π2=π3=1/3. To corrupt a data point, we randomly selected half of its coordinates and set them randomly to either a large positive value or a large negative value (here, 5 and −5). As a metric, we measured the average mean squared Euclidean distance between the means of the fitted model and the ground truth model.

For all methods, we used random restarts, choosing the final model based on the method’s criterion: likelihood for MLE, the OKL estimate for OWL, and empirical Hellinger distance for Pearson residuals. We see that OWL remains robust against varying levels of corruptions, whereas both MLE and Pearson residuals perform significantly worse (left panel of Figure 3d).

Bernoulli product mixture models.

Consider the following model for p-dimensional binary data: there are a collection of probability vectors λ1,,λK[0,1]p and mixing weights πΔK. Each data point xi is drawn i.i.d. according to the process

zi~Categorical(π)
xij~Bernoulliλzijforj=1,,p.

For our simulations, we generated a synthetic dataset of 1000 points in {0,1}100 by first drawing K=3 means λ1,,λK whose coordinates are i.i.d. from a Beta(1/10,1/10) distribution. The mixing weights were chosen to be uniform over the components. To corrupt a data point, we flipped each zero coordinate with probability 1/2. As a metric, we measured the average 1-distance between the λ parameters of the fitted model and the ground truth model.

The right panel of Figure 3d shows the results of the Bernoulli mixture model simulations. We see that OWL remains robust against varying levels of corruptions, whereas both MLE and Pearson residuals perform significantly worse.

5. Application to micro-credit study

In this section we apply OWL to data from a micro-credit study for which maximum likelihood estimation (MLE) is shown to be brittle. In Angelucci et al. (2015), the authors worked with one of the largest micro-lenders in Mexico to randomize their credit rollout across 238 geographical regions in the Sonora state. Within 18–34 months after this rollout, the authors surveyed n=16,560 households for various outcome measures.

Following Broderick et al. (2023), here we focus on the Average Intention to Treat effect (AIT) of the rollout on household profits. For i{1,,n}, let Yi denote the profit of the ith household during the last fortnight (in USD PPP units), and let Ti{0,1} be a binary variable that is one if the household i falls in the geographical region where the rollout happened. The AIT on household profits is defined as the coefficient β1 in the model:

Yi=β0+β1Ti+εi,εi~i.i.d.N0,σ2,i{1,,n}. (8)

In Supplement S13.1, we reproduce the brittleness in estimating β1 using the MLE as demonstrated in Broderick et al. (2023) by removing an handful of observations.

Here we compare OWL to the above data deletion approach. We fit the model (8) to the full data set using 50 log10 -spaced ϵ-values between 10−4 and 10−1, and used the tuning procedure in Section 2.2.3 to select the value ϵ0=0.005 where the minimum-OKL versus epsilon plot (Supplement S13, Figure S9) has its most prominent kink. We also calculate the MLE, which corresponds to OWL with ϵ=0. The AIT on household profit estimated by OWL as a function of ϵ can be seen in the left panel of Figure 4. For values of ϵ below ϵ0, the AIT estimates change rapidly as ϵ changes, while for values of ϵ above ϵ0, the AIT estimates are quite stable with changes in ϵ. This is due to OWL automatically down-weighting the outlying observations, as seen in the right panel of Figure 4. We quantify uncertainty in our estimates by using a variant of the Bootstrap (Supplement S13.2).

Figure 4:

Figure 4:

Estimating the Average Intention to Treat (AIT) effect on household profits in the micro-credit study Angelucci et al. (2015) in the presence of outliers. Left: the AIT estimates using OWL for various values of ϵ along with 90% bootstrap vertical confidence bands. The vertical line is drawn at the value ϵ0=0.005 obtained by the tuning procedure in Section 2.2.3, and roughly coincides with the ϵ beyond which the AIT estimates stabilize and the size of the confidence bands shrinks (see Supplement S13). Right: shows that the weights estimated by OWL procedure at ϵ=ϵ0 down-weight roughly 1% of the households that have outlying profit values (for visual clarity, we omit a down-weighted household with profit less that -40K USD PPP); see also Figure S10 in Supplement S13.

In summary, the OWL procedure chose to down-weight roughly 1% of the households with extreme profit values and estimated an AIT of β1=0.6 USD PPP per fortnight based on the selected value of ϵ0=0.005. The value ϵ=ϵ0, tuned using the procedure in Section 2.2.3, roughly coincides with the point at which the AIT estimates become stable with respect to ϵ and also with the point at which the 90% confidence bands for AIT become narrower — both suggesting that OWL with the choice ϵ=ϵ0 has identified and down-weighted outliers that may be causing brittleness in estimating AIT.

6. Discussion

In this paper, we introduced the optimistically weighted likelihood (OWL) methodology, motivated by brittleness issues arising from misspecification in statistical methodology based on standard likelihoods. On the theoretical side, we established the consistency and robustness of our approach and showed its asymptotic connection to the coarsened inference methodology. We also proposed a feasible alternating optimization scheme to implement the methodology and demonstrated its empirical utility on both simulated and real data.

The OWL methodology opens up several interesting future directions. One practical open problem is how to scale to larger datasets. As a weighted likelihood method, OWL requires solving for a weight vector whose dimension is the size of the dataset. While we can solve the resulting convex optimization problem for thousands of data points, the procedure becomes significantly more complicated when the size of the dataset exceeds computer memory. How do we maintain a feasible solution, i.e. one that lies in the intersection of the simplex and some probability ball, when the entire vector cannot fit in memory?

Another practical question is how to apply the OWL approach in more complex models; for example, involving dependent data. This may be relatively straightforward for models in which the likelihood can still be written in product form due to conditional independence given random effects. This would open up its application to nested, longitudinal, spatial and temporal data, as random effects models are routinely used in such settings.

Finally, it will be fruitful to implement the OWL method for a choice of d other than the TV distance. Indeed, as reflected in Assumption 2.1 and our robustness results, the choice of d controls the nature of robustness offered by OWL. An important choice for d may be the Wasserstein distance because it metrizes the weak convergence topology on bounded spaces (Huber and Ronchetti, 2009, Section 2.4) and, as argued in Chapter 1 of Huber and Ronchetti (2009), can capture robustness to both small changes in all observations (e.g. due to rounding or grouping) and large changes in a few observations (e.g. due to contamination or outliers). Indeed, while several of our theoretical results already hold for the Wasserstein metric, appropriate modifications to the OKL estimator and the OWL algorithm will be needed to make this approach feasible.

Supplementary Material

Supp 1

The accompanying supplementary materials contain a real data application to cluster single-cell RNAseq data, along with additional details (figures, lemmas, definitions, etc. with names starting with the letter ‘S’) referenced in the article.

Acknowledgement

The authors acknowledge funding from grants N00014–21-1–2510-P00001 from the Office of Naval Research (ONR) and R01 ES027498, R01 ES035625, U54 CA274492–01 and R37CA271186 from the National Institutes of Health, as well as helpful discussions with Sayan Mukherjee and Amarjit Budhiraja. The authors also wish to acknowledge the anonymous referees for pointing us to relevant literature.

Footnotes

The authors report that there are no competing interests to declare.

See https://github.com/cjtosh/owl for code to reproduce all analyses.

References

  1. Shun’ichi Amari. Information Geometry and its Applications. Springer, 2016. [Google Scholar]
  2. Angelucci Manuela, Karlan Dean, and Zinman Jonathan. Microcredit impacts: Evidence from a randomized microcredit program placement experiment by Compartamos Banco. American Economic Journal: Applied Economics, 7(1):151–182, 2015. [Google Scholar]
  3. Medina Marco Avella and Ronchetti Elvezio. Robust statistics: A selective overview and new directions. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6):372–393, 2015. [Google Scholar]
  4. Avella-Medina Marco and Ronchetti Elvezio. Robust and consistent variable selection in high-dimensional generalized linear models. Biometrika, 105(1):31–44, 2018. [Google Scholar]
  5. Basu Ayanendranath and Lindsay Bruce G. Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, 46:683–705, 1994. [Google Scholar]
  6. Basu Ayanendranath, Harris Ian R, Hjort Nils L, and Jones MC. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998. [Google Scholar]
  7. Beath Ken J. A mixture-based approach to robust analysis of generalised linear models. Journal of Applied Statistics, 45(12):2256–2268, 2018. [Google Scholar]
  8. Bernton Espen, Jacob Pierre E, Gerber Mathieu, and Robert Christian P. On parameter estimation with the Wasserstein distance. Information and Inference: A Journal of the IMA, 8(4):657–676, 2019. [Google Scholar]
  9. Bondell Howard D and Stefanski Leonard A. Efficient robust regression via two-stage generalized empirical likelihood. Journal of the American Statistical Association, 108(502):644–655, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, and Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011. [Google Scholar]
  11. Bradic Jelena. Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electronic Journal of Statistics, 10(2):3894–3944, 2016. [Google Scholar]
  12. Bradic Jelena, Fan Jianqing, and Wang Weiwei. Penalized composite quasi-likelihood for ultrahigh dimensional variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):325–349, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Briol Francois-Xavier, Barp Alessandro, Duncan Andrew B, and Girolami Mark. Statistical inference for generative models with maximum mean discrepancy. arXiv preprint arXiv:1906.05944, 2019. [Google Scholar]
  14. Broderick Tamara, Giordano Ryan, and Meager Rachael. An automatic finite-sample robustness metric: When can dropping a little data make a big difference? arXiv preprint arXiv:2011.14999, 2023. [Google Scholar]
  15. Cai Diana, Campbell Trevor, and Broderick Tamara. Finite mixture models do not reliably learn the number of components. In International Conference on Machine Learning, pages 1158–1169, 2021. [Google Scholar]
  16. Chandra Noirrit Kiran, Canale Antonio, and Dunson David B. Escaping the curse of dimensionality in Bayesian model-based clustering. Journal of Machine Learning Research, 24(144):1–42, 2023. [Google Scholar]
  17. Chen Mengjie, Gao Chao, and Ren Zhao. A general decision theory for Huber’s ϵ - contamination model. Electronic Journal of Statistics, 10:3752–3774, 2016. [Google Scholar]
  18. Cherief-Abdellatif Badr-Eddine and Alquier Pierre. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference, pages 1–21, 2020. [Google Scholar]
  19. Choi Edwin, Hall Peter, and Presnell Brett. Rendering parametric procedures more robust by empirically tilting the model. Biometrika, 87(2):453–465, 2000. [Google Scholar]
  20. Claeskens Gerda and Hjort Nils Lid. Model Selection and Model Averaging. Cambridge University Press, 2008. [Google Scholar]
  21. Cover Thomas M and Thomas Joy A. Elements of Information Theory. John Wiley & Sons, second edition, 2006. [Google Scholar]
  22. Csiszár Imre. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975. [Google Scholar]
  23. Dembo Amir and Zeitouni Ofer. Large Deviations Techniques and Applications. Springer Berlin Heidelberg, 2010. [Google Scholar]
  24. Donoho David L and Liu Richard C. The “automatic” robustness of minimum distance functionals. The Annals of Statistics, 16(2):552–586, 1988. [Google Scholar]
  25. Dupuis Debbie J and Morgenthaler Stephan. Robust weighted likelihood estimators with an application to bivariate extreme value problems. Canadian Journal of Statistics, 30(1):17–36, 2002. [Google Scholar]
  26. Field C and Smith B. Robust estimation: A weighted maximum likelihood approach. International Statistical Review/Revue Internationale de Statistique, 62(3):405–424, 1994. [Google Scholar]
  27. Ghosh Abhik and Basu Ayanendranath. A new family of divergences originating from model adequacy tests and application to robust statistical inference. IEEE Transactions on Information Theory, 64(8):5581–5591, 2018. [Google Scholar]
  28. Greco Luca and Agostinelli Claudio. Weighted likelihood mixture modeling and model-based clustering. Statistics and Computing, 30(2):255–277, 2020. [Google Scholar]
  29. Hampel Frank R., Ronchetti Elvezio M., Rousseeuw Peter J., and Stahel Werner A.. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, 2005. [Google Scholar]
  30. Huber Peter J. Robust estimation of a location parameter. Ann. Math. Statist, 35(4):73–101, 1964. [Google Scholar]
  31. Huber Peter J. and Ronchetti Elvezio M.. Robust Statistics. Wiley, 2009. [Google Scholar]
  32. Huber-Carol Catherine, Balakrishnan Narayanaswamy, Nikulin Mikhail, and Mesbah Mounir. Goodness-of-fit Tests and Model Validity. Springer Science & Business Media, 2012. [Google Scholar]
  33. Hüllermeier Eyke. Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization. International Journal of Approximate Reasoning, 55(7):1519–1534, 2014. [Google Scholar]
  34. Hüllermeier Eyke and Cheng Weiwei. Superset learning based on generalized loss minimization. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part II, pages 260–275, 2015. [Google Scholar]
  35. Hunter David R and Lange Kenneth. A tutorial on MM algorithms. The American Statistician, 58(1):30–37, 2004. [Google Scholar]
  36. Thi Hoai An Le and Dinh Tao Pham. DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68, 2018. [Google Scholar]
  37. LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [Google Scholar]
  38. Lienen Julian and Hüllermeier Eyke. Instance weighting through data imprecisiation. International Journal of Approximate Reasoning, 134:1–14, 2021. [Google Scholar]
  39. Lindsay Bruce G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. The Annals of Statistics, 22(2):1081–1114, 1994. [Google Scholar]
  40. Liu Allen and Moitra Ankur. Robustly learning general mixtures of Gaussians. Journal of the ACM, 70(3):1–53, 2023. [Google Scholar]
  41. Liu Jiawei and Lindsay Bruce G.. Building and using semiparametric tolerance regions for parametric multinomial models. The Annals of Statistics, 37(6A):3644–3659, 2009. [Google Scholar]
  42. Mancini Loriano, Ronchetti Elvezio, and Trojani Fabio. Optimal conditionally unbiased bounded-influence inference in dynamic location and scale models. Journal of the American Statistical Association, 100(470):628–641, 2005. [Google Scholar]
  43. Markatou Marianthi. Mixture models, robustness, and the weighted likelihood methodology. Biometrics, 56(2):483–486, 2000. [DOI] [PubMed] [Google Scholar]
  44. Markatou Marianthi, Basu Ayanendranath, and Lindsay Bruce G. Weighted likelihood equations with bootstrap root search. Journal of the American Statistical Association, 93(442):740–750, 1998. [Google Scholar]
  45. Maronna Ricardo A, Douglas Martin R, Yohai Victor J, and Salibián-Barrera Matías. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019. [Google Scholar]
  46. Metsis Vangelis, Androutsopoulos Ion, and Paliouras Georgios. Spam filtering with naive Bayes - which naive Bayes? In CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27–28, 2006, Mountain View, California, USA, 2006. [Google Scholar]
  47. Miller Jeffrey W. and Dunson David B.. Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Olier Ivan, 2020. QSAR datasets – Meta-QSAR, V1. Mendeley Data; 10.17632/spwgrcnjdg.1. [DOI] [Google Scholar]
  49. Olier Ivan, Sadawi Noureddin, Richard Bickerton G, Vanschoren Joaquin, Grosan Crina, Soldatova Larisa, and King Ross D. Meta-QSAR: A large-scale application of meta-learning to drug design and discovery. Machine Learning, 107(1):285–311, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Parikh Neal and Boyd Stephen. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014. [Google Scholar]
  51. Rogers David and Hahn Mathew. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, 2010. [DOI] [PubMed] [Google Scholar]
  52. Rousseeuw Peter J and Leroy Annick M. Robust regression and outlier detection. John Wiley & Sons, 2005. [Google Scholar]
  53. Royall Richard. Statistical evidence: a likelihood paradigm. Routledge, 2017. [Google Scholar]
  54. Samdani Rajhans, Chang Ming-Wei, and Roth Dan. Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 688–698, 2012. [Google Scholar]
  55. Satopaa Ville, Albrecht Jeannie, Irwin David, and Raghavan Barath. Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 31st International Conference on Distributed Computing Systems Workshops, pages 166–171, 2011. [Google Scholar]
  56. She Yiyuan and Owen Art B. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639, 2011. [Google Scholar]
  57. Simon-Gabriel Carl-Johann and Schölkopf Bernhard. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. The Journal of Machine Learning Research, 19(1):1708–1736, 2018. [Google Scholar]
  58. Taplin Ross H. Robust likelihood calculation for time series. Journal of the Royal Statistical Society Series B: Statistical Methodology, 55(4):829–836, 1993. [Google Scholar]
  59. Tsou Tsung-Shan and Royall Richard M. Robust likelihoods. Journal of the American Statistical Association, 90(429):316–320, 1995. [Google Scholar]
  60. Parr William C. Minimum distance estimation: a bibliography. Communications in Statistics – Theory and Methods, 10(12):1205–1224, 1981. [Google Scholar]
  61. Windham Michael P. Robustifying model fitting. Journal of the Royal Statistical Society. Series B (Methodological), 57(3):599–609, 1995. [Google Scholar]
  62. Wolfowitz Jacob. The minimum distance method. The Annals of Mathematical Statistics, pages 75–88, 1957. [Google Scholar]
  63. Yatracos Yannis G. Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics, 13(2):768–774, 1985. [Google Scholar]
  64. Zellner Arnold. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES