Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Jul 6;117(29):16880–16890. doi: 10.1073/pnas.1922664117

Universal inference

Larry Wasserman a,b,1,2, Aaditya Ramdas a,1, Sivaraman Balakrishnan a,1
PMCID: PMC7382245  PMID: 32631986

Significance

Most statistical methods rely on certain mathematical conditions, known as regularity assumptions, to ensure their validity. Without these conditions, statistical quantities like P values and confidence intervals might not be valid. In this paper we give a surprisingly simple method for producing statistical significance statements without any regularity conditions. The resulting hypothesis tests can be used for any parametric model and for several nonparametric models.

Keywords: likelihood, testing, irregular models, confidence sequence

Abstract

We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions. We refer to such procedures as “universal.” The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call “the split likelihood-ratio test” (split LRT) statistic. The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models. Our method is especially appealing for statistical inference in these complex setups. The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null. Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult. We also develop various extensions of our basic methods. We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood. We investigate some conditions under which our methods yield valid inferences under model misspecification. Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid P values and confidence sequences. Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes.


The foundations of statistics are built on a variety of generally applicable principles for parametric estimation and inference. In parametric statistical models, the likelihood-ratio test and confidence intervals obtained from asymptotically Gaussian estimators are the workhorse inferential tools for constructing hypothesis tests and confidence intervals. Often, the validity of these methods relies on large sample asymptotic theory and requires that the statistical model satisfy certain regularity conditions; see Section 2 for precise definitions. When these conditions do not hold, there is no general method for statistical inference, and these settings are typically considered in an ad hoc manner. Here, we introduce a universal method which yields tests and confidence sets for any statistical model and has finite-sample guarantees.

We begin with some terminology. A parametric statistical model is a collection of distributions {Pθ:θΘ} for an arbitrary set Θ. When the aforementioned regularity conditions hold, there are many methods for inference. For example, if ΘRd, the set

An=θ:2logL(θ^)L(θ)cα,d [1]

is the likelihood-ratio confidence set, where cα,d is the upper α quantile of a χd2 distribution, L is the likelihood function, and θ^ is the maximum-likelihood estimator (MLE). It satisfies the asymptotic coverage guarantee

Pθ*(θ*An)1α

as n, where Pθ* denotes the unknown true data-generating distribution.

Constructing tests and confidence intervals for irregular models—where the regularity conditions do not hold—is very difficult (1). An example is mixture models. In this case we observe Y1,,YnP and we want to test

H0:PMk0versusH1:PMk1, [2]

where Mk denotes the set of mixtures of k Gaussians, with an appropriately restricted parameter space Θ (see for instance ref. 2) and with k0<k1. Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is to base the test on the likelihood-ratio statistic but this turns out to have an intractable limiting distribution (3). As we discuss further in Section 3, developing practical, simple tests for this pair of hypotheses is an active area of research (refs. 46 and references therein). However, it is possible that we may be able to compute an MLE using variants of the expectation–maximization (EM) algorithm. In this paper, we show that there is a remarkably simple test based on the MLE with guaranteed finite-sample control of the type I error. Similarly, we construct a confidence set for the parameters of a mixture model with guaranteed finite-sample coverage. These tests and confidence sets can in fact be used for any model. In regular statistical models (those for which the usual LRT is well behaved), our methods may not be optimal, although we do not yet fully understand how close to optimal they are beyond special cases (uniform, Gaussian). Our test is most useful in irregular (or singular) models for which valid tests are not known or require many assumptions. Going beyond parametric models, we show that our methods can be used for several nonparametric models as well and have a natural sequential analog.

1. Universal Inference

Let Y1,,Y2n be an independent and identically distributed (i.i.d.) sample from a distribution Pθ* which belongs to a collection (Pθ:θΘ). Note that θ* denotes the true value of the parameter. Assume that each distribution Pθ has a density pθ with respect to some underlying measure μ (for instance, the Lebesgue or counting measure).

A Universal Confidence Set.

We construct a confidence set for θ* by first splitting the data into two groups D0 and D1. For simplicity, we take each group to be of the same size n but this is not necessary. Let θ^1 be any estimator constructed from D1; this can be the MLE, a Bayes estimator that utilizes prior knowledge, a robust estimator, etc. Let

L0(θ)=iD0pθ(Yi)

denote the likelihood function based on D0. We define the split likelihood-ratio statistic (split LRS) as

Tn(θ)=L0(θ^1)L0(θ). [3]

Then, the universal confidence set is

Cn=θΘ:Tn(θ)1α. [4]

Similarly, define the cross-fit LRS as

Sn(θ)=(Tn(θ)+Tnswap(θ))/2, [5]

where Tnswap is formed by calculating Tn after swapping the roles of D0 and D1. We can also define Cn with Sn in place of Tn.

Theorem 1.

Cn is a finite-sample valid (1α) confidence set for θ*, meaning that Pθ*(θ*Cn)1α.

If we did not split the data and θ^1 was the MLE, then Tn(θ) would be the usual likelihood-ratio statistic and we would typically approximate its distribution using an asymptotic argument. For example, as mentioned earlier, in regular models, −2 times the log-likelihood-ratio statistic has, asymptotically, a χd2 distribution. But, in irregular models this strategy can fail. Indeed, finding or approximating the distribution of the likelihood-ratio statistic is highly nontrivial in irregular models. The split LRS avoids these complications.

Now we explain why Cn has coverage at least 1α, as claimed by Theorem 1. We prove it for the version using Tn, but the proof for Sn is identical. Consider any fixed ψΘ and let A denote the support of Pθ*. Then,

Eθ*L0(ψ)L0(θ*)=Eθ*iD0pψ(Yi)iD0pθ*(Yi)=AiD0pψ(yi)iD0pθ*(yi)iD0pθ*(yi)dy1dyn=AiD0pψ(yi)dy1dyniD0pψ(yi)dyi=1.

Since θ^1 is fixed when we condition on D1, we have

Eθ*[Tn(θ*)|D1]=Eθ*L0(θ^1)L0(θ*)|D11. [6]

Now, using Markov’s inequality,

Pθ*(θ*Cn)=Pθ*Tn(θ*)>1ααEθ*[Tn(θ*)]=αEθ*L0(θ^1)L0(θ*)=αEθ*Eθ*L0(θ^1)L0(θ*)|D1α. [7]

Remark 2:

The parametric setup adopted above generalizes easily to nonparametric settings as long as we can calculate a likelihood. For a collection of densities P, and a true density p*P, suppose we use D1 to identify p^1P and D0 to calculate

Tn(p)=iD0p^1(Yi)p(Yi).

We then define Cn{pP:Tn(p)1/α}, and our previous argument ensures that Pp*(p*Cn)1α.

A Universal Hypothesis Test.

Now we turn to hypothesis testing. Let Θ0Θ be a possibly composite null set and consider testing

H0:θ*Θ0versusθ*Θ0. [8]

The alternative above can be replaced by θ*Θ1 for any Θ1Θ or by θ*Θ1\Θ0. One way to test this hypothesis is based on the universal confidence set in Eq. 4. We simply reject the null hypothesis if CnΘ0=. It is straightforward to see that if this test makes a type I error, then the universal confidence set must fail to cover θ*, and so the type I error of this test is at most α.

We present an alternative method that is often computationally (and possibly statistically) more attractive. Let θ^1 be any estimator constructed from D1, and let

θ^0argmaxθΘ0L0(θ)

be the MLE under H0 constructed from D0. Then the universal test, which we call the split likelihood-ratio test (split LRT), is defined as

rejectH0ifUn>1/α,whereUn=L0(θ^1)L0(θ^0). [9]

Similarly, we can define the cross-fit LRT as

rejectH0ifWn>1/α,whereWn=Un+Unswap2, [10]

where, as before, Unswap is calculated like Un after swapping the roles of D0 and D1.

Theorem 3.

The split and cross-fit LRTs control the type I error at α; i.e., supθ*Θ0Pθ*(Un>1/α)α.

The proof is straightforward. We prove it for the split LRT, but once again the cross-fit proof is identical. Suppose that H0 is true and θ*Θ0 is the true parameter. By Markov’s inequality, the type I error is

Pθ*(Un>1/α)=Pθ*L0(θ^1)/L0(θ^0)>1/ααEθ*L0(θ^1)L0(θ^0)(i)αEθ*L0(θ^1)L0(θ*)(ii)α.

Above, inequality (i) uses the fact that L0(θ^0)L0(θ*) which is true when θ^0 is the MLE, and inequality (ii) follows by conditioning on D1 as argued earlier in Eq. 7.

Remark 4.

We may drop the use of Θ,Θ0,Θ1 above and extend the split LRT to a general nonparametric setup. Both tests can be used to test any null H0:p*P0 against any alternative H1:p*P1. Importantly, no parametric assumption is needed on P0,P1, and no relationship is imposed whatsoever between P0,P1. As before, use D1 to identify p^1P1, use D0 to calculate the MLE p^0P0, and define Un=iD0p^1(Yi)p^0(Yi).

We call these procedures universal to mean that they are valid in finite samples with no regularity conditions. Constructions like this are reminiscent of ideas used in sequential settings where an estimator is computed from past data and the likelihood is evaluated on current data; we expand on this in Section 7.

We note in passing that another universal set is the following. Define C=θ:ΘL(ψ)dΠ(ψ)/L(θ)1/α, where L is the full likelihood (from all of the data) and Π is any prior. This also has the same coverage guarantee but requires specifying a prior and doing an integral. In irregular or nonparametric models, the integral will typically be intractable.

Perspective: Poor Man’s Chernoff Bound.

At first glance, the reader may worry that Markov’s inequality seems like a weak tool to use, resulting in an underpowered conservative test or confidence interval. However, this is not the right perspective. One should really view our proof as using a “poor man’s Chernoff bound.”

For a regular model, we would usually compare the log-likelihood ratio to the (1α) quantile of a χ2 distribution (with degrees of freedom related to the difference in dimensionality of the null and alternate models). Instead, we compare the log-split-likelihood ratio to log(1/α), which scales like the (1α) quantile of a χ2 distribution with one degree of freedom.

In any case, instead of finding the asymptotic distribution of logUn (usually having a moment-generating function, like a χ2), our proof should be interpreted as using the simpler but nontrivial fact that Eθ*[elog(Un)]1. Hence we are really using the fact that logUn has an exponential tail, just as an asymptotic argument would.

A true Chernoff-style bound for a χ2 random variable would have bounded Eθ*[ealog(Un)] by an appropriate function of a and then optimized over the choice of a>0 to obtain a tight bound. Our methods correspond to choosing a=1, leading us to call the technique a poor man’s Chernoff bound. The key point is that our methods should be viewed as using Markov’s inequality on the exponential of the random variable of interest.

Perspective: In-Sample versus Out-of-Sample Likelihood.

We may rewrite the universal set as

Cn=θΘ:2logL0(θ^1)L0(θ)2log(1/α).

For a regular model, it is natural to compare the above expression to the usual LRT-based set An from Eq. 1. At first, it may visually seem like the LRT-based set uses the threshold cα,d, while the universal set uses 2log(1/α) which is much smaller in high dimensions. However, a key point to keep in mind is that comparing the numerators of the test statistics in both cases, the classical likelihood-ratio set uses an in-sample likelihood and the split LRS confidence set uses an out-of-sample likelihood. Hence, simply comparing the thresholds does not suffice to draw a conclusion about the relative sizes of the confidence sets. We next check that for regular models, the size of the universal set indeed shrinks at the right rate.

2. Sanity Check: Regular Models

Although universal methods are not needed for well-behaved models, it is worth checking their behavior in these cases. We expect that Cn would not have optimal size but we would hope that it still shrinks at the optimal rate. We now confirm that this is true.

Throughout this example we treat the dimension as a fixed constant before subsequently turning our attention to an example where we more carefully track the dependence of the confidence set diameter on dimension. In this and subsequent sections we use standard stochastic order notation for convergence in probability op and boundedness in probability Op (7). We make the following regularity assumptions (see for instance ref. 7 for a detailed discussion of these conditions):

  • 1)
    The statistical model is identifiable; i.e., for any θθ* it is the case that PθPθ*. The statistical model is differentiable in quadratic mean (DQM) at θ*; i.e., there exists a function sθ* such that
    pθpθ*12(θθ*)Tsθ*pθ*2dμ=o(θθ*2),asθθ*.
  • 2)
    The parameter space ΘRd is compact, and the log-likelihood is a smooth function of θ; i.e., there is a measurable function with supθPθ2< such that for any θ1,θ2Θ
    |logpθ1(x)logpθ2(x)|(x)θ1θ2.
  • 3)
    A consequence of the DQM condition is that the Fisher information matrix
    I(θ*)Eθ*[sθ*sθ*T]
    is well defined, and we assume it is nondegenerate.

Under these conditions the optimal confidence set has (expected) diameter O(1/n). Our first result shows that the same is true of the universal set, provided that the initial estimate θ^1 is n consistent; i.e., θ^1θ*=Op(1/n). Under the conditions of our theorem, this consistency condition is satisfied when θ^1 is the MLE but our result is more generally applicable.

Theorem 5.

Suppose that θ^1 is a n-consistent estimator of θ*. Under the assumptions above, the split LRT confidence set has diameter Op(log(1/α)/n).

A proof of this result is in SI Appendix. At a high level, to bound the diameter of the split LRT set it suffices to show that for any θ sufficiently far from θ*, it is the case that

L0(θ)L0(θ^1)α.

To establish this, note that we can write this condition as

logL0(θ)L0(θ*)+logL0(θ*)L0(θ1^)log(α).

Bounding the first term requires showing if we consider any θ sufficiently far from θ*, its likelihood is small relative to the likelihood of θ*. We build on the work of Wong and Shen (8) who provide uniform upper bounds on the likelihood ratio under technical conditions which ensure that the statistical model is not too big. Conversely, to bound the second term we need to argue that if θ^1 is sufficiently close to θ*, then it must be the case that their likelihoods cannot be too different. This in turn follows by exploiting the DQM condition.

Analyzing the Nonparametric Split LRT.

While our previous result focused on the diameter of the split LRT set in parametric problems, similar techniques also yield results in the nonparametric case. In this case, since we have no underlying parameter space, it will be natural to measure the diameter of our confidence set in terms of some metric on probability distributions. We consider bounding the diameter of our confidence set in the Hellinger metric. Formally, for two distributions P and Q the (squared) Hellinger distance is defined as

H2(P,Q)=12dPdQ2.

We will also require the use of the χ2 divergence given by

χ2(P,Q)=dPdQ12dQ,

assuming that P is absolutely continuous with respect to Q. Roughly, and analogous to our development in the parametric case, to bound the diameter of the split LRT confidence set, we need to ensure that our statistical model P is not too large and further that our initial estimate p^1 is sufficiently close to p*.

To measure the size of P we use its Hellinger bracketing entropy. Denote by logN(u,F) the Hellinger bracketing entropy of the class of distributions F where the bracketing functions are separated by at most u in the Hellinger distance (we refer to ref. 8 for a precise definition). We suppose that the bracketing entropy of P is not too large; i.e., for some ϵn>0 we have that for some constant c>0,

ϵn2ϵnlog(N(u,P))ducnϵn2. [11]

Although we do not explore this in detail, we note in passing that the smallest value ϵn for which the above condition is satisfied provides an upper bound on the rate of convergence of the nonparametric MLE in the Hellinger distance (8). To characterize the quality of p^1 we use the χ2 divergence. Concretely, we suppose that

χ2(p*,p^1)Op(ηn2). [12]

Theorem 6.

Under conditions Eqs. 11 and 12, the split LRT confidence set has Hellinger diameter upper bounded by Op(ηn+ϵn+log(1/α)/n).

Comparing LRT to Split LRT for the Multivariate Normal Case.

In the previous calculation we treated the dimension of the parameter space as fixed. To understand the behavior of the method as a function of dimension in the regular case, suppose that Y1,,YnNd(θ,I), where θRd. Recalling that we use cα,d and zα to denote the upper α quantiles of the χd2 and standard Gaussian, respectively, the usual confidence set for θ based on the LRT is

An=θ:θY¯2cα,dn=θ:θY¯2d+2dzα+o(d)n,

where the second form follows from the normal approximation of the χd2 distribution. For the universal set, we use the sample average from D1 as our initial estimate θ^1. Denoting the sample means Y¯1 and Y¯0 we see that

Cn=θ:logL0(Y¯1)logL0(θ)log(1/α),

which is the set of θ such that

n2Y¯0Y¯122+n2θY¯022log1α.

In other words, we may rewrite

Cn=θ:θY¯024nlog1α+Y¯0Y¯12.

Next, note that Y¯0Y¯12=Op(d/n), so both sets have radii Op(d/n). Precisely, the squared radius Rn2 of Cn is

Rn2=d4log(1/α)+4χd2n=d4log(1/α)+4d+32dZ+Op(d)n,

where Z is an independent standard Gaussian. So both their squared radii share the same scaling with d and n, and for large d and constant α, the squared radius of Cn is about 4 times larger than that of An.

3. Examples

Mixture Models.

As a proof of concept, we do a small simulation to check the type I error and power for mixture models. Specifically, let Y1,,Y2nP, where YiR. We want to distinguish the hypotheses in Eq. 2. For this brief example, we take k0=1 and k1=2.

Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is the likelihood-ratio statistic but, as mentioned earlier, this has an intractable limiting distribution. To the best of our knowledge, the only practical test for the above hypothesis with a tractable limiting distribution is the EM test due to ref. 4. This very clever test is similar to the likelihood-ratio test except that it includes some penalty terms and requires the maximization of some of the parameters to be restricted. However, the test requires choosing some tuning parameters and, more importantly, it is restricted to one-dimensional problems. There is no known confidence set for mixture problems with guaranteed coverage properties. Another approach is based on the bootstrap (5) but there is no proof of the validity of the bootstrap for mixtures.

Fig. 1 shows the power of the test when n=200 and θ^1 is the MLE under the full model M2. The true model is taken to be (1/2)ϕ(y;μ,1)+(1/2)ϕ(y;μ,1), where ϕ is a normal density with mean μ and variance 1. The null corresponds to μ=0. We take α=0.1 and the MLE is obtained by the EM algorithm, which we assume converges on this simple problem. Understanding the local and global convergence (and nonconvergence) of the EM algorithm to the MLE is an active research area but is beyond the scope of this paper (refs. 911 and references therein). As expected, the test is conservative with type I error near 0 but has reasonable power when μ>1.

Fig. 1.

Fig. 1.

The plot shows the power of the universal/bootstrap (black/red) tests for a simple Gaussian mixture, as the mean-separation μ varies (μ=0 is the null). The sample size is n=200 and the target level is α=0.1.

Fig. 1 also shows the power of the bootstrap test (5). Here, the P value is obtained by bootstrapping the LRS under the estimated null distribution. As expected, this has higher power than the universal test since it does not split the data. In this simulation, both tests control the type I error, but unfortunately the bootstrap test does not have any guarantee on the type I error, even asymptotically. The lower power of the universal test is the price paid for having a finite-sample guarantee. It is also worth noting that the bootstrap test requires running the EM algorithm for each bootstrap sample while the universal test requires only one EM run.

Model Selection Using Sieves.

Sieves are a general approach to nonparametric inference. A sieve (12) is a sequence of nested models P1P2. If we assume that the true density p* is in Pj for some (unknown) j, then universal testing can be used to choose the model. One possibility is to test Hj:p*Pj one by one for j=1,2,. We reject Hj if

iD0p^j+1(Yi)p^j(Yi)>1/α,

where p^j is the MLE in model Pj. Then we take j^ to be the first j such that Hj is not rejected and proclaim that p*Pj for some jj^. Even though we test multiple different hypotheses and stop at a random j^, this procedure still controls the type I error, meaning that

Pp*(p*Pj^1)α,

meaning that our proclamation is correct with high probability. The reason we do not need to correct for multiple testing is because a type I error can occur only once we have reached the first j such that p*Pj.

A simple application is to choose the number of mixture components in a mixture model, as discussed in the previous example. Here are some other interesting examples in which the aforementioned ideas yield valid tests and model selection using sieves: 1)testing the number of hidden states in a hidden Markov model (the MLE is computable using the Baum–Welch algorithm), 2) testing the number of latent factors in a factor model, and 3) testing the sparsity level in a high-dimensional linear model Y=Xβ+ϵ (under H0:β is k sparse, the MLE corresponds to best-subset selection).

Whenever we can compute the MLE (specifically, the likelihood it achieves), then we can run our universal test, and we can do model selection using sieves. We will later see that an upper bound of the maximum likelihood suffices and is sometimes achievable by minimizing convex relaxations of the negative log-likelihood.

Nonparametric Example: Shape-Constrained Inference.

A density p is log-concave if p=eg for some concave function g. Consider testing H0:p is log-concave versus H1:p is not log-concave. Let P0 be the set of log-concave densities and let p^0 denote the nonparametric maximum-likelihood estimator over P0 computed using D0 (13) which can be computed in polynomial time (14). Let p^1 be any nonparametric density estimator such as the kernel density estimator (15) fitted on D1. In this case, the universal test is to reject H0 when

iD0p^1(Yi)p^0(Yi)>1α.

To the best of our knowledge this is the first test for this problem with finite-sample guarantee. Under the assumption that pP0, the universal confidence set is

Cn=pP0:iD0p(Yi)αiD0p^1(Yi).

While the aforementioned test can be efficiently performed, the set Cn may be hard to explicitly represent, but we can check whether a distribution pCn efficiently.

Positive Dependence (Multivariate Total Positivity of Order 2).

The split LRT solves a variety of open problems related to testing for a general notion of positive dependence called multivariate total positivity of order 2 (MTP2) (16). The convex optimization problem of maximum-likelihood estimation in Gaussian models under total positivity was recently solved (17), but in ref. 17, example 5.8 and the following discussion, they state that the testing problem is still open. Given data from a multivariate distribution p, consider testing H0:p is Gaussian MTP2 against H1:p is Gaussian (or an even more general alternative). Since proposition 2.2 in ref. 17 shows that the MLE under the null can be efficiently calculated, our universal test is applicable.

In fact, calculating the MLE in any MTP2 exponential family is a convex optimization problem (ref. 18, theorem 3.1), thus making a test immediately feasible. As a particularly interesting special case, ref. 18, section 5.1 provides an algorithm for computing the MLE for MTP2 Ising models. Testing H0:p is Ising MTP2 against H1:p is Ising is stated as an open problem in ref. 18, section 6, and is solved by our universal test. (We remark that even though the MTP2 MLE is efficiently computable, evaluating the maximum likelihood in the Ising case may still take O(2d) time for a d-dimensional problem.)

Finally, MTP2 can be combined with log-concavity, uniting shape constraints and dependence. General existence and uniqueness properties of the MLE for totally positive log-concave densities have been recently derived (19), along with efficient algorithms to compute the MLE. Our methods immediately yield a test for H0:p is MTP2 log-concave against H1:p is log-concave.

All of the above models were singular, and hence the LRS has been hard to study. In some cases, its asymptotic null distribution is known to be a weighted sum of χ2 distributions, where the weights are rather complicated properties of the distributions (usually unknown to the practitioner). In contrast, the split LRT is applicable without assumptions, and its validity is nonasymptotic.

Independence versus Conditional Independence.

Consider data that are trivariate vectors of the form (X1i,X2i,X3i) which are modeled as trivariate normal. The goal is to test H0:X1 and X2 are independent versus H1:X1 and X2 are independent given X3. The motivation for this test is that this problem arises in the construction of causal graphs. It is surprisingly difficult to test these nonnested hypotheses. Indeed, Guo and Richardson (20) study carefully the subtleties of the problem and they show that the limiting distribution of the LRS is complicated and cannot be used for testing. They propose a new test based on a concept called envelope distributions. Despite the fact that the hypotheses are nonnested, the universal test is applicable and can be used quite easily for this problem. Further, one can also flip H0 and H1 and test for conditional independence in the Gaussian setting as well. We leave it to future work to compare the power of the universal test and the envelope test.

Cross-Fitting Can Beat Splitting: Uniform Distribution.

In all previous examples, the split LRT is a reasonable choice. However, in this example, the cross-fit approach easily dominates the split approach. Note that this is a case where we would not recommend our universal tests since there are well-studied standard confidence intervals in this model. The example is just meant to bring out the difference between the split and cross-fit approaches.

Suppose that pθ is the uniform density on [0,θ]. Let us take θ^1 to be the MLE from D1. Thus, θ^1 is the maximum of the data points in D1. Now L0(θ)=θnI(θθ^0), where θ^0 is the maximum of the data points in D0. It follows that Cn=[0,) whenever θ^1<θ^0 which happens with probability 1/2. The set Cn has the required coverage but is too large to be useful. This happens because the densities have different support. A similar phenomenon occurs when testing H0:θA versus H1:θR+ for some fixed A>0, but not when testing against H1:θ>A. One can partially avoid this behavior by choosing θ^1 to not be the MLE. However, the simplest way to avoid the degeneracy is to use the cross-fit approach, where we swap the roles of D0 and D1, and average the resulting test statistics. Exactly one of two test statistics will be 0, and hence the average will be nonzero. Further, it is easy to show that this test and resulting interval are rate optimal, losing a constant factor due to data splitting over the standard tests and interval constructions. In more detail, the classical (exact) pivotal 1α confidence interval for θ is C2n=[θ^,θ^(1/α)1/(2n)], where θ^ is the maximum of all of the data points. On the other hand, for θ^1,θ^0 defined above, assuming without loss of generality that θ^0θ^1 a direct calculation shows that the cross-fit interval takes the form Cn=[θ^0,θ^1(2/α)1/n]. Ignoring constants, both these intervals have expected length O(θlog(1/α)/n).

4. Derandomization

The universal method involves randomly splitting the data and the final inferences will depend on the randomness of the split. This may lead to instability, where different random splits produce different results; in a related context, this has been called the “P-value lottery” (21).

We can get rid of or reduce the variability of our inferences, at the cost of more computation by using many splits, while maintaining validity of the method. The key property that we used in both the universal confidence set and the split LRT is that Eθ*[Tn]1, where Tn=L0(θ^1)/L0(θ^). Imagine that we obtained B such statistics Tn,1,Tn,B with the same property. Let

T¯n=B1j=1BTn,j.

Then we still have that Eθ*[T¯n]1 and so inference using our universal methods can proceed using the combined statistic T¯n. Note that this is true regardless of the dependence between the statistics.

Using the aforementioned idea, we can immediately design natural variants of the universal method:

  • K-fold. We can split the data once into 2Kn folds. Then repeat the following K times: Use K1 folds to calculate θ^1 and evaluate the likelihood ratio on the last fold. Finally, average the K statistics. Alternatively, we could use onefold to calculate θ^1 and evaluate the likelihood on the other K1 folds.

  • Subsampling. We do not need to split the data just once into K folds. We can repeat the previous procedure for repeated random splits of the data into K folds. We expect this to reduce variance that arises from the algorithmic randomness.

  • All splits. We can remove all algorithmic randomness by considering all possible splits. While this is computationally infeasible, the potential statistical gains are worth studying.

We remark that all these variants allow a large amount of flexibility. For example, in cross-fitting, θ^1 need not be used the same way in both splits: It could be the MLE on one split, but a Bayesian estimator on another split. This flexibility could be useful if the user does not know which variant would lead to higher power in advance and would like to hedge across multiple natural choices. Similarly, in the K-fold version, if a user is confused whether to evaluate the likelihood ratio on onefold or on K1 folds, then the user can do both and average the statistics.

Of course, with such flexibility comes the risk of an analyst cherry picking the variant used after looking at which form of averaging results in the highest LR (this would correspond to taking the maximum instead of the average of multiple variants), but this is a broader issue. For this reason (and this reason alone), the cross-fitting LRT proposed initially may be a useful default in practice, since it is both conceptually and computationally simple. We have already seen that (twofold) cross-fit inference improves over split inference drastically in the case of the uniform distribution discussed in the previous section. We leave a more detailed theoretical and empirical analysis of the power of these variants to future work.

5. Extensions

Profile Likelihood and Nuisance Parameters.

Suppose that we are interested in some function ψ=g(θ). Let

Bn=ψ:Cng1(ψ),

where we define g1(ψ)={θ:g(θ)=ψ}. By construction, Bn is a 1α confidence set for ψ. Defining the profile-likelihood function

L0(ψ)=supθ:g(θ)=ψL0(θ), [13]

we can rewrite Bn as

Bn=ψ:L0(θ^1)L0(ψ)1α. [14]

In other words, the same data-splitting idea works for the profile likelihood too. As a particularly useful example, suppose θ=(θu,θn), where θn is a nuisance component; then we can define g(θ)=θu to obtain a universal confidence set for only the component θu we care about.

Upper Bounding the Null Maximum Likelihood.

Computing the MLE and/or the maximum likelihood (under the null) is sometimes computationally hard. Suppose one could come up with a relaxation F0 of the null likelihood L0. This should be a proper relaxation in the sense that

maxθF0(θ)maxθL0(θ).

For example, L0 may be defined as outside its domain, but F0 could extend the domain. As another example, instead of minimizing the negative log-likelihood which could be nonconvex and hence hard to minimize, we could minimize a convex relaxation. In such settings, define

θ^0FargmaxθF0(θ).

If we define the test statistic

TnL0(θ^1)F0(θ^0F),

then the split LRT may proceed using Tn instead of Tn. This is because F0(θ^0F)L0(θ^0), and hence TnTn.

One particular case when this would be useful is the following. While discussing sieves, we had mentioned that testing the sparsity level in a high-dimensional linear model involves solving the best subset selection problem, which is nondeterministic polynomial-time hardness in the worst case. There exist well-known quadratic programming relaxations that are more computationally tractable. Another example is testing whether a random graph is a stochastic block model, for which semidefinite relaxations of the MLE are well studied (22); similar situations arise in communication theory (23) and angular synchronization (24).

The takeaway message is that it suffices to upper bound the maximum likelihood to perform inference.

Robustness via Powered Likelihoods.

It has been suggested by some authors (2529) that inferences can be made robust by replacing the likelihood L with the power likelihood Lη for some 0<η<1. Note that

EθL0(θ^1)L0(θ)η|D1=iD0pθ^1η(yi)pθ1η(yi)dyi1,

and hence all of the aforementioned methods can be used with the robustified likelihood as well. (The last inequality follows because the η-Renyi divergence is nonnegative.)

Smoothed Likelihoods.

Sometimes the MLE is not consistent or it may not exist since the likelihood function is unbounded, and a (doubly) smoothed likelihood has been proposed as an alternative (30). For simplicity, consider a kernel k(x,y) such that k(x,y)dy=1 for any x, for example a Gaussian or Laplace kernel. For any density pθ, let its smoothed version be denoted

pθ(y)k(x,y)pθ(x)dx.

Note that pθ is also a probability density. Denote the smoothed empirical density based on D0 as

pn1|D0|iD0k(Xi,).

Define the smoothed maximum-likelihood estimator as the Kullback–Leibler (KL) projection of pn onto {pθ}θΘ0,

θ0argminθΘ0K(pn,pθ),

where K(P,Q) denotes the KL divergence between P and Q. If we define the smoothed likelihood on D0 as

L0(θ)iD0expk(Xi,y)logpθ(y)dy,

then it can be checked that θ0 maximizes the smoothed likelihood; that is, θ0=argmaxθΘ0L0(θ). As before, let θ^1Θ be any estimator based on D1. The smoothed split LRT is defined analogous to Eq. 9 as

rejectH0ifUn>1/α,whereUn=L0(θ^1)L0(θ0). [15]

We now verify that the smoothed split LRT controls type I error. First, for any fixed ψΘ, we have

Eθ*L0(ψ)L0(θ0)(i)Eθ*L0(ψ)L0(θ*)=iD0expk(x,y)logpψ(y)pθ*(y)dypθ*(x)dx(ii)k(x,y)pψ(y)pθ*(y)dypθ*(x)dx=k(x,y)pθ*(x)dxpθ*(y)pψ(y)dy=pψ(y)dy=1.

Above, step (i) is because θ0 maximizes the smoothed likelihood, and step (ii) follows by Jensen’s inequality. An argument mimicking Eqs. 6 and 7 completes the proof. As a last remark, similar to the unsmoothed case, note that upper bounding the smoothed maximum likelihood under the null also suffices.

Conditional Likelihood for Non-i.i.d. Data.

Our presentation so far has assumed that the data are drawn i.i.d. from some distribution under the null. However, this is not really required (even under the null) and was assumed for expositional simplicity. All that is needed is that we can calculate the likelihood on D0 conditional on D1 (or vice versa). For example, this could be tractable in models involving sampling without replacement from an urn with Mn balls. Here θ could represent the unknown number of balls of different colors. Such hypergeometric sampling schemes result in non-i.i.d. data, but conditional on one subset of data (for example how many red, green, and blue balls were sampled from the urn in that subset), one can evaluate the conditional likelihood of the second half of the data and maximize it, rendering it possible to apply our universal tests and confidence sets.

6. Misspecification and Convex Model Classes

There are some natural examples of convex model classes (31, 32), including 1) all mixtures (potentially infinite) of a set of base distributions, 2) distributions with the first moment specified/bounded and possibly other moments bounded (e.g., first moment equals zero, second moment bounded by one), 3) the set of (coordinate-wise) monotonic densities with the same support, 4) unimodal densities with the same mode, 5) densities that are symmetric about the same point, 6) distributions with the same median or multiple quantiles (e.g., median = 0, 0.9 quantile = 2), 7) the set of all K-tuples (P1,,PK) of distributions satisfying a fixed partial stochastic ordering (e.g., all triplets (P1,P2,P3) such that P1P2 and P1P3, where is the usual stochastic ordering), and 8) the set of convex densities with the same support. Some cases like 6) and 7) also result in weakly closed convex sets, as does case 2) for a specified mean. (Several of these examples also apply in discrete settings such as constrained multinomials.)

It is often possible to calculate the MLE over these convex model classes using convex optimization; for example see refs. 33 and 34 for case 7). This renders our universal tests and confidence sets immediately applicable. However, in this special case, it is also possible to construct additional tests, and the universal confidence set has some nontrivial guarantees if the model is misspecified.

Model Misspecification.

Suppose the data come from a distribution Q with density qPΘ{pθ}θΘ, meaning that the model is misspecified and the true distribution does not belong to the considered model. In this case, what does the universal set Cn defined in Eq. 4 contain? We will answer this question when the set of measures/densities PΘ is convex. Define the Kullback–Leibler divergence of q from PΘ as

K(q,PΘ)infθΘK(q,pθ).

Following definition 4.2 in Li’s (31) PhD thesis, a function p*pqΘ* is called the reversed information projection (RIPR) of q onto PΘ if for every sequence pn with K(q,pn)K(q,PΘ), we have logpnlogp* in L1(Q). Theorem 4.3 in ref. 31 proves that p* exists and is unique, satisfies K(q,p*)=K(q,PΘ), and

θΘ,EYqpθ(Y)p*(Y)1. [16]

The above statement can be loosely interpreted as “if the data come from qPΘ, its RIPR p* will have higher likelihood than any other model in expectation.” We discuss this condition further at the end of this subsection.

It might be reasonable to ask whether the universal set contains p*. For various technical reasons (detailed in ref. 31) it is not the case, in general, that p* belongs to the collection PΘ. Since the universal set considers densities in PΘ only by construction, it cannot possibly contain p* in general. However, when p* is a density in PΘ, then it is indeed covered by our universal set.

Proposition 7.

Suppose that the data come from qPΘ. If PΘ is convex and there exists a density p*PΘ such that K(q,p*)=infθΘK(q,pθ), then we have Pq(p*Cn)1α.

The proof is short. Examining the proof of Theorem 1, we must simply verify that for each iD0, we have

Eqpθ^1(Yi)p*(Yi)1,

which follows from Eq. 16. Here is a heuristic argument for why Eq. 16 holds when p*PΘ. For any θΘ, note that K(q,PΘ)=K(q,p*)=minα[0,1]K(q,αp*+(1α)pθ) since PΘ is convex. The Karush–Kuhn–Tucker condition for this optimization problem is that gradient with respect to α is negative at α=1 (the minimizer). Exchanging derivative and integral immediately yields Eq. 16. This argument is formalized in ref. 31, chap. 4.

An Alternate Split LRT (RIPR Split LRT).

We return back to the well-specified case for the rest of this paper. First note that the fact in Eq. 16 can be rewritten as

θΘ,EYpθq(Y)p*(Y)1, [17]

which is informally interpreted as “if the data come from pθ, then any alternative qPΘ will have lower likelihood than its RIPR p* in expectation.” This motivates the development of an alternate RIPR split LRT to test composite null hypotheses that is defined as follows. As before, we divide the data into two parts, D0 and D1, and let θ^1Θ1 be any estimator found using only D1. Now, define p0* to be the RIPR of pθ^1 onto the null set {pθ}θΘ0. The RIPR split LRT rejects the null if

RniD0pθ^1(Yi)p0*(Yi)>1/α.

The main difference from the original MLE split LRT is that earlier we ignored θ^1 and simply calculated the MLE θ^0 under the null based on D0.

Proposition 8.

If {pθ}θΘ is a convex set of densities, then supθ0Θ0Pθ0(Rn>1/α)α.

The fact that p0* is potentially not an element of {pθ}θΘ0 does not matter here. The validity of the test follows exactly the same logic as the MLE split LRT, observing that Eq. 17 implies that for any true θ*Θ0, we have

Epθ*pθ^1(Yi)p0*(Yi)1.

Without sample splitting and with a fixed alternative distribution, the RIPR LRT has been recently studied (35). When PΘ is convex and the RIPR split LRT is implementable, meaning that it is computationally feasible to find the RIPR or evaluate its likelihood, then this test can be more powerful than the MLE split LRT. Specifically, if the RIPR is actually a density in the null set, then

Rn=iD0pθ^1(Yi)p0*(Yi)iD0pθ^1(Yi)pθ^0(Yi)=Un,

since θ^0 maximizes the denominator among null densities. Because of the restriction to convex sets, and since there exist many more subroutines to calculate the MLE over a set than to find the RIPR, the MLE split LRT is more broadly applicable than the RIPR split LRT.

7. Anytime P Values and Confidence Sequences

Just like the sequential likelihood-ratio test (36) extends the LRT, the split LRT has a simple sequential extension. Similarly, the confidence set can be extended to a “confidence sequence” (37).

Suppose the split LRT failed to reject the null. Then we are allowed to collect more data and update the test statistic (in a particular fashion) and check if the updated statistic crosses 1/α. If it does not, we can further collect more data and reupdate the statistic, and this process can be repeated indefinitely. Importantly we do not need any correction for repeated testing; this is primarily because the statistic is upper bounded by a nonnegative martingale. We describe the procedure next in the case when each additional dataset is of size one, but the same idea applies when we collect data in groups.

The Running MLE Sequential LRT.

Consider the following, more standard, sequential testing/estimation setup. We observe an i.i.d. sequence Y1,Y2, from Pθ*. We want to test the hypothesis in Eq. 8. Let θ^1,t1 be any nonanticipating estimator based on the first t1 samples, for example the MLE, argmaxθΘ1i=1t1pθ(Yi), or a regularized version of it to avoid misbehavior at small sample sizes. Denote the null MLE as

θ^0,t=argmaxθΘ0i=1tpθ(Yi).

At any time t, reject the null and stop if

Mti=1tpθ^1,i1(Yi)i=1tpθ^0,t(Yi)>1/α.

This test is computationally expensive: We must calculate θ^1,t1 and θ^0,t at each step. In some cases, these may be quick to calculate by warm starting from θ^1,t2 and θ^0,t1. For example, the updates can be done in constant time for exponential families, since the MLE is often a simple function of the sufficient statistics. However, even in these cases, the denominator takes time O(t) to recompute at step t.

The following result shows that with probability at least 1α, this test will never stop under the null. Let τθ denote the stopping time when the data are drawn from Pθ, which is finite only if we stop and reject the null.

Theorem 9.

The running MLE LRT has type I error at most α, meaning that supθ*Θ0Pθ*(τθ*<)α.

The proof involves the simple observation that under the null, Mt is upper bounded by a nonnegative martingale Lt with initial value one. Specifically, define the (oracle) process starting with L01 and

Lti=1tpθ^i1(Yi)i=1tpθ*(Yi)Lt1pθ^t1(Yt)pθ*(Yt). [18]

Note that under the null, we have MtLt because θ^0,t and θ* both belong to Θ0, but the former maximizes the null likelihood (denominator). Further, it is easy to verify that Lt is a nonnegative martingale with respect to the natural filtration Ft=σ(Y1,,Yt). Indeed,

Eθ*[Lt|Ft1]=Eθ*i=1tpθ^i1(Yi)i=1tpθ*(Yi)|Ft1=Lt1Eθ*pθ^t1(Yt)pθ*(Yt)|Ft1=Lt1,

where the last equality mimics Eq. 6. To complete the proof, we note that the type I error of the running MLE LRT is simply bounded as

Pθ*(tN:Mt>1/α)Pθ*(tN:Lt>1/α)(i)Eθ*[L0]α=α,

where step (i) follows by Ville’s inequality (38, 39), a time-uniform version of Markov’s inequality for nonnegative supermartingales.

Naturally, this test does not have to start at t=1 when only one sample is available, meaning that we can set M0=M1==Mt0=1 for the first t0 steps and then begin the updates. Similarly, t need not represent the time at which the tth sample was observed; it can just represent the tth recalculation of the estimators (there may be multiple samples observed between t1 and t).

Anytime-Valid P Values.

We can also get a P value that is uniformly valid over time. Specifically, both pt=1/Mt and p¯t=minst1/Ms may serve as P values.

Theorem 10.

For any random time T, not necessarily a stopping time, supθ*Θ0Pθ*(p¯Tx)x for x[0,1].

The aforementioned property is equivalent to the statement that under the null P(tN:p¯tα)α, and its proof follows by substitution immediately from the previous argument. Naturally p¯tpt, but from the perspective of designing a level α test they are equivalent, because the first time that pt falls below α is also the first time that p¯t falls below α. The term “anytime-valid” is used because, unlike typical P values, these are valid at (data-dependent) stopping times or even random times chosen post hoc. Hence, inference is robust to “peeking,” optional stopping, and optional continuation of experiments. Such anytime P values can be inverted to yield confidence sequences, as described below.

Confidence Sequences.

A confidence sequence for θ* is an infinite sequence of confidence intervals that are all simultaneously valid. Such confidence intervals are valid at arbitrary stopping times and also at other random data-dependent times that are chosen post hoc. In the same setup as above, but without requiring a null set Θ0, define the running MLE likelihood-ratio process

Rt(θ)i=1tpθ^1,i1(Yi)i=1tpθ(Yi).

Then, a confidence sequence for θ* is given by

Ct{θ:Rt(θ)1/α}.

In fact, the running intersection C¯t=stCs is also a confidence sequence; note that C¯tCt.

Theorem 11.

Ct and C¯t are confidence sequences for θ*, meaning that Pθ*(tN:θ*C¯t)α. Equivalently, Pθ*(θ*Cτ)1α for any stopping time τ, and also Pθ*(θ*CT)1α for any arbitrary random time T.

The proof is straightforward. First, note that θ*C¯t for some t if and only if θ*Ct for some t. Hence,

Pθ*(tN:θ*Ct)=Pθ*(tN:Rt(θ*)>1/α)α,

where the last step uses, as before, Ville’s inequality for the martingale Rt(θ*)Lt from Eq. 18. The fact that the other two statements in Theorem 11 are equivalent to the first one follows from recent work (40).

Duality.

It is worth remarking that confidence sequences are dual to anytime P values, just like confidence intervals are dual to standard P values, in the sense that a (1α) confidence sequence can be formed by inverting a family of level α sequential tests (each testing a different point in the space), and a level α sequential test for a composite null set Θ0 can be obtained by checking whether the (1α) confidence sequence intersects the null set Θ0.

In fact, our constructions of pt and Ct (without running minimum/intersection) obey the same property: pt<α only if CtΘ0=, and the reverse implication follows if Θ0 is closed. To see the forward implication, assume that there exists some element θCtΘ0. Since θCt, we have Rt(θ)1/α. Since θΘ0, we have infθ*Θ0Rt(θ*)1/α. This last condition can be restated as Mt1/α, which means that ptα.

It is also possible to obtain an anytime P value from a family of confidence sequences at different α, by defining pt as the smallest α for which CtCt(α) intersects Θ0.

Extensions.

All of the extensions from Section 5 extend immediately to the sequential setting. One can handle nuisance parameters using profile likelihoods; this for example leads to sequential t tests (for the Gaussian family, with the variance as a nuisance parameter), which also yield confidence sequences for the Gaussian mean with unknown variance. Non-i.i.d. data, such as in sampling without replacement, can be handled using conditional likelihoods, and robustness can be increased with powered likelihoods. In these situations, the corresponding underlying process Lt may not be a martingale, but a supermartingale. Also, as before, we may also use upper bounds on the maximum likelihood at each step (perhaps minimizing convex relaxations of the negative log-likelihood) or smooth the likelihood if needed.

Such confidence sequences have been developed under very general nonparametric, multivariate, matrix, and continuous-time settings using generalizations of the aforementioned supermartingale technique (3941). The connections between anytime-valid P values, e values, safe tests, peeking, confidence sequences, and the properties of optional stopping and continuation have been explored recently (35, 40, 42, 43). The connection to the present work is that when run sequentially, our universal (MLE or RIPR) split LRT yields an anytime-valid P value, an e value, and a safe test, which can be inverted to form universal confidence sequences and are valid under optional stopping and continuation, and these are simply because the underlying process of interest is bounded by a nonnegative (super)martingale. This line of research began over 50 y ago by Darling and Robbins (37), Robbins (44), Robbins and Siegmund (45), and Lai (46, 47). In fact, for testing point nulls, the running MLE (or nonanticipating) martingale was suggested in passing by Wald (ref. 48, equation 10:10) and analyzed in depth by refs. 45 and 49 where connections were shown to the mixture sequential probability-ratio test. These ideas have been utilized in changepoint detection for both point nulls (50) and composite nulls (51).

8. Conclusion

Inference based on the split likelihood-ratio statistic (and variants) leads to simple tests and confidence sets with finite-sample guarantees. Our methods are most useful in problems where standard asymptotic methods are difficult/impossible to apply, such as complex composite null testing problems or nonparametric confidence sets. Going forward, we intend to run simulations in a variety of models to study the power of the test and the size of the confidence sets and study their optimality in special cases. We do not expect the test to be rate optimal in all cases, but it might have analogous properties to the generalized LRT. It would also be interesting to extend these methods (like the profile-likelihood variant) to semiparametric problems where there are a finite-dimensional parameter of interest and an infinite-dimensional nuisance parameter.

9. Data Availability

Due to space constraints, we have relegated technical details of the proofs of Theorems 5 and 6 to SI Appendix. There are no additional data, protocols, or code associated with this paper.

Supplementary Material

Supplementary File
pnas.1922664117.sapp.pdf (244.1KB, pdf)

Acknowledgments

We thank Caroline Uhler and Arun K. Kuchibhotla for references to open problems in shape-constrained inference and Ryan Tibshirani for suggesting the relaxed-likelihood idea. We are grateful to Bin Yu, Hue Wang and Marco Molinaro for helpful feedback which motivated parts of Section 6. We thank the reviewers and Dennis Boos for helpful suggestions and Len Stefanski for pointing us to work on smoothed likelihoods.

Footnotes

Competing interest statement: L.W. and R.T. are coauthors on a manuscript written in 2015 and published in 2018.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1922664117/-/DCSupplemental.

References

  • 1.Drton M., Likelihood ratio tests and singularities. Ann. Stat. 37, 979–1012 (2009). [Google Scholar]
  • 2.Redner R., Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Stat. 9, 225–228 (1981). [Google Scholar]
  • 3.Dacunha-Castelle D., Gassiat E., Testing in locally conic models, and application to mixture models. ESAIM Probab. Stat. 1, 285–317 (1997). [Google Scholar]
  • 4.Chen J., Li P., Hypothesis test for normal mixture models: The EM approach. Ann. Stat. 37, 2523–2542 (2009). [Google Scholar]
  • 5.McLachlan G. J., On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 36, 318–324 (1987). [Google Scholar]
  • 6.Chakravarti P., Balakrishnan S., Wasserman L., Gaussian mixture clustering using relative tests of fit. arXiv:1910.02566 (7 October 2019).
  • 7.Van der Vaart A. W., Asymptotic Statistics (Cambridge University Press, 2000), vol. 3. [Google Scholar]
  • 8.Wong W. H., Shen X., Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann. Stat. 23, 339–362 (1995). [Google Scholar]
  • 9.Balakrishnan S., Wainwright M. J., Yu B., Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat. 45, 77–120 (2017). [Google Scholar]
  • 10.Xu J., Hsu D. J., Maleki A., “Global analysis of expectation maximization for mixtures of two Gaussians” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 2676–2684. [Google Scholar]
  • 11.Jin C., Zhang Y., Balakrishnan S., Wainwright M. J., Jordan M. I., “Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 4116–4124. [Google Scholar]
  • 12.Shen X., Wong W. H., Convergence rate of sieve estimates. Ann. Stat. 22, 580–615 (1994). [Google Scholar]
  • 13.Cule M., Samworth R., Stewart M., Maximum likelihood estimation of a multi-dimensional log-concave density. J. R. Stat. Soc. B 72, 545–607 (2010). [Google Scholar]
  • 14.Axelrod B., Diakonikolas I., Stewart A., Sidiropoulos A., Valiant G., “A polynomial time algorithm for log-concave maximum likelihood via locally exponential families” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2019), vol. 32, pp. 7723–7735. [Google Scholar]
  • 15.Silverman B. W., Density Estimation for Statistics and Data Analysis (Routledge, 2018). [Google Scholar]
  • 16.Karlin S., Rinott Y., Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multivariate Anal. 10, 467–498 (1980). [Google Scholar]
  • 17.Lauritzen S., Uhler C., Zwiernik P., Maximum likelihood estimation in Gaussian models under total positivity. Ann. Stat. 47, 1835–1863 (2019). [Google Scholar]
  • 18.Lauritzen S., Uhler C., Zwiernik P., Total positivity in structured binary distributions. arXiv:1905.00516 (1 May 2019).
  • 19.Robeva E., Sturmfels B., Tran N., Uhler C., Maximum likelihood estimation for totally positive log-concave densities. arXiv:1806.10120 (26 June 2018).
  • 20.Guo F., Richardson T. S., On testing marginal versus conditional independence. arXiv:1906.01850 (5 June 2019).
  • 21.Meinshausen N., Meier L., Bühlmann P., P-values for high-dimensional regression. J. Am. Stat. Assoc. 104, 1671–1681 (2009). [Google Scholar]
  • 22.Amini A. A., Levina E., On semidefinite relaxations for the block model. Ann. Stat. 46, 149–179 (2018). [Google Scholar]
  • 23.Dahl J., Fleury B. H., Vandenberghe L., “Approximate maximum-likelihood estimation using semidefinite programming“ in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, 2003), vol. 6, pp. VI–721. [Google Scholar]
  • 24.Bandeira A. S., Boumal N., Singer A., Tightness of the maximum likelihood semidefinite relaxation for angular synchronization. Math. Program. 163, 145–167 (2017). [Google Scholar]
  • 25.Royall R., Tsou T. S., Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions. J. R. Stat. Soc. B 65, 391–404 (2003). [Google Scholar]
  • 26.Grünwald P., “The safe Bayesian” in International Conference on Algorithmic Learning Theory (Springer, Berlin, Germany, 2012), pp. 169–183. [Google Scholar]
  • 27.Holmes C., Walker S., Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104, 497–503 (2017). [Google Scholar]
  • 28.Grünwald P., Van Ommen T., Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12, 1069–1103 (2017). [Google Scholar]
  • 29.Miller J. W., Dunson D. B., Robust Bayesian inference via coarsening. J. Am. Stat. Assoc. 114, 1113–1125 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Seo B., Lindsay B. G., A universally consistent modification of maximum likelihood. Stat. Sin. 23, 467–487 (2013). [Google Scholar]
  • 31.Li Q. J., “Estimation of mixture models,” PhD thesis, Yale University, New Haven, CT (1999). [Google Scholar]
  • 32.Hoff P. D., Nonparametric estimation of convex models via mixtures. Ann. Stat. 31, 174–200 (2003). [Google Scholar]
  • 33.Brunk H., Franck W., Hanson D., Hogg R., Maximum likelihood estimation of the distributions of two stochastically ordered random variables. J. Am. Stat. Assoc. 61, 1067–1080 (1966). [Google Scholar]
  • 34.Dykstra R. L., Feltz C. J., Nonparametric maximum likelihood estimation of survival functions with a general stochastic ordering and its dual. Biometrika 76, 331–341 (1989). [Google Scholar]
  • 35.Grunwald P., de Heide R., Koolen W., Safe testing. arXiv:1906.07801 (18 June 2019).
  • 36.Wald A., Sequential tests of statistical hypotheses. Ann. Math. Stat. 16, 117–186 (1945). [Google Scholar]
  • 37.Darling D., Robbins H., Confidence sequences for mean, variance, and median. Proc. Natl. Acad. Sci. U.S.A. 58, 66–68 (1967). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ville J., Étude Critique de la Notion de Collectif (Gauthier-Villars, Paris, France, 1939). [Google Scholar]
  • 39.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Time-uniform Chernoff bounds via nonnegative supermartingales. Probab. Surv. 17, 257–317 (2020). [Google Scholar]
  • 40.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Uniform, nonparametric, non-asymptotic confidence sequences. arXiv:1810.08240 (18 October 2018).
  • 41.Howard S. R., Ramdas A., Sequential estimation of quantiles with applications to A/B-testing and best-arm identification. arXiv:1906.09712 (24 June 2019).
  • 42.Johari R., Koomen P., Pekelis L., Walsh D., Peeking at A/B Tests: Why It Matters, and What to Do about It (ACM Press, 2017), pp. 1517–1525. [Google Scholar]
  • 43.Shafer G., Shen A., Vereshchagin N., Vovk V., Test martingales, Bayes factors and p-values. Stat. Sci. 26, 84–101 (2011). [Google Scholar]
  • 44.Robbins H., Statistical methods related to the law of the iterated logarithm. Ann. Math. Stat. 41, 1397–1409 (1970). [Google Scholar]
  • 45.Robbins H., Siegmund D., “A class of stopping rules for testing parametric hypotheses” in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, CA, 1970–1972), vol. 4, pp. 37–41. (1972). [Google Scholar]
  • 46.Lai T. L., On confidence sequences. Ann. Stat. 4, 265–280 (1976). [Google Scholar]
  • 47.Lai T. L., Boundary crossing probabilities for sample sums and confidence sequences. Ann. Probab. 4, 299–312 (1976). [Google Scholar]
  • 48.Wald A., Sequential Analysis (Courier Corporation, 1947). [Google Scholar]
  • 49.Robbins H., Siegmund D., The expected sample size of some tests of power one. Ann. Stat. 2, 415–436 (1974). [Google Scholar]
  • 50.Lorden G., Pollak M., Nonanticipating estimation applied to sequential analysis and changepoint detection. Ann. Stat. 33, 1422–1454 (2005). [Google Scholar]
  • 51.Vexler A., Martingale type statistics applied to change points detection. Commun. Stat. Theor. Methods 37, 1207–1224 (2008). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1922664117.sapp.pdf (244.1KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES