Universal inference

Larry Wasserman; Aaditya Ramdas; Sivaraman Balakrishnan

doi:10.1073/pnas.1922664117

. 2020 Jul 6;117(29):16880–16890. doi: 10.1073/pnas.1922664117

Universal inference

Larry Wasserman ^a,^b,^1,², Aaditya Ramdas ^a,¹, Sivaraman Balakrishnan ^a,¹

PMCID: PMC7382245 PMID: 32631986

Significance

Most statistical methods rely on certain mathematical conditions, known as regularity assumptions, to ensure their validity. Without these conditions, statistical quantities like P values and confidence intervals might not be valid. In this paper we give a surprisingly simple method for producing statistical significance statements without any regularity conditions. The resulting hypothesis tests can be used for any parametric model and for several nonparametric models.

Keywords: likelihood, testing, irregular models, confidence sequence

Abstract

We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions. We refer to such procedures as “universal.” The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call “the split likelihood-ratio test” (split LRT) statistic. The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models. Our method is especially appealing for statistical inference in these complex setups. The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null. Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult. We also develop various extensions of our basic methods. We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood. We investigate some conditions under which our methods yield valid inferences under model misspecification. Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid $P$ values and confidence sequences. Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes.

The foundations of statistics are built on a variety of generally applicable principles for parametric estimation and inference. In parametric statistical models, the likelihood-ratio test and confidence intervals obtained from asymptotically Gaussian estimators are the workhorse inferential tools for constructing hypothesis tests and confidence intervals. Often, the validity of these methods relies on large sample asymptotic theory and requires that the statistical model satisfy certain regularity conditions; see Section 2 for precise definitions. When these conditions do not hold, there is no general method for statistical inference, and these settings are typically considered in an ad hoc manner. Here, we introduce a universal method which yields tests and confidence sets for any statistical model and has finite-sample guarantees.

We begin with some terminology. A parametric statistical model is a collection of distributions ${P_{θ} : θ \in Θ}$ for an arbitrary set $Θ$ . When the aforementioned regularity conditions hold, there are many methods for inference. For example, if $Θ \subseteq R^{d}$ , the set

A_{n} = \{θ : 2 \log \frac{L (\hat{θ})}{L (θ)} \leq c_{α, d}\}

[1]

is the likelihood-ratio confidence set, where $c_{α, d}$ is the upper $α$ quantile of a $χ_{d}^{2}$ distribution, $L$ is the likelihood function, and $\hat{θ}$ is the maximum-likelihood estimator (MLE). It satisfies the asymptotic coverage guarantee

P_{θ^{*}} (θ^{*} \in A_{n}) \to 1 - α

as $n \to \infty$ , where $P_{θ^{*}}$ denotes the unknown true data-generating distribution.

Constructing tests and confidence intervals for irregular models—where the regularity conditions do not hold—is very difficult (1). An example is mixture models. In this case we observe $Y_{1}, \dots, Y_{n} \sim P$ and we want to test

H_{0} : P \in M_{k_{0}} versus H_{1} : P \in M_{k_{1}},

[2]

where $M_{k}$ denotes the set of mixtures of $k$ Gaussians, with an appropriately restricted parameter space $Θ$ (see for instance ref. 2) and with $k_{0} < k_{1}$ . Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is to base the test on the likelihood-ratio statistic but this turns out to have an intractable limiting distribution (3). As we discuss further in Section 3, developing practical, simple tests for this pair of hypotheses is an active area of research (refs. 4–6 and references therein). However, it is possible that we may be able to compute an MLE using variants of the expectation–maximization (EM) algorithm. In this paper, we show that there is a remarkably simple test based on the MLE with guaranteed finite-sample control of the type I error. Similarly, we construct a confidence set for the parameters of a mixture model with guaranteed finite-sample coverage. These tests and confidence sets can in fact be used for any model. In regular statistical models (those for which the usual LRT is well behaved), our methods may not be optimal, although we do not yet fully understand how close to optimal they are beyond special cases (uniform, Gaussian). Our test is most useful in irregular (or singular) models for which valid tests are not known or require many assumptions. Going beyond parametric models, we show that our methods can be used for several nonparametric models as well and have a natural sequential analog.

1. Universal Inference

Let $Y_{1}, \dots, Y_{2 n}$ be an independent and identically distributed (i.i.d.) sample from a distribution $P_{θ^{*}}$ which belongs to a collection $(P_{θ} : θ \in Θ)$ . Note that $θ^{*}$ denotes the true value of the parameter. Assume that each distribution $P_{θ}$ has a density $p_{θ}$ with respect to some underlying measure $μ$ (for instance, the Lebesgue or counting measure).

A Universal Confidence Set.

We construct a confidence set for $θ^{*}$ by first splitting the data into two groups $D_{0}$ and $D_{1}$ . For simplicity, we take each group to be of the same size $n$ but this is not necessary. Let ${\hat{θ}}_{1}$ be any estimator constructed from $D_{1}$ ; this can be the MLE, a Bayes estimator that utilizes prior knowledge, a robust estimator, etc. Let

L_{0} (θ) = \prod_{i \in D_{0}} p_{θ} (Y_{i})

denote the likelihood function based on $D_{0}$ . We define the split likelihood-ratio statistic (split LRS) as

T_{n} (θ) = \frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ)} .

[3]

Then, the universal confidence set is

C_{n} = \{θ \in Θ : T_{n} (θ) \leq \frac{1}{α}\} .

[4]

Similarly, define the cross-fit LRS as

S_{n} (θ) = (T_{n} (θ) + T_{n}^{swap} (θ)) / 2,

[5]

where $T_{n}^{swap}$ is formed by calculating $T_{n}$ after swapping the roles of $D_{0}$ and $D_{1}$ . We can also define $C_{n}$ with $S_{n}$ in place of $T_{n}$ .

Theorem 1.

$C_{n}$ is a finite-sample valid $(1 - α)$ confidence set for $θ^{*}$ , meaning that $P_{θ^{*}} (θ^{*} \in C_{n}) \geq 1 - α$ .

If we did not split the data and ${\hat{θ}}_{1}$ was the MLE, then $T_{n} (θ)$ would be the usual likelihood-ratio statistic and we would typically approximate its distribution using an asymptotic argument. For example, as mentioned earlier, in regular models, −2 times the log-likelihood-ratio statistic has, asymptotically, a $χ_{d}^{2}$ distribution. But, in irregular models this strategy can fail. Indeed, finding or approximating the distribution of the likelihood-ratio statistic is highly nontrivial in irregular models. The split LRS avoids these complications.

Now we explain why $C_{n}$ has coverage at least $1 - α$ , as claimed by Theorem 1. We prove it for the version using $T_{n}$ , but the proof for $S_{n}$ is identical. Consider any fixed $ψ \in Θ$ and let $A$ denote the support of $P_{θ^{*}}$ . Then,

\begin{array}{l} E_{θ^{*}} [\frac{L_{0} (ψ)}{L_{0} (θ^{*})}] & = E_{θ^{*}} [\frac{\prod_{i \in D_{0}} p_{ψ} (Y_{i})}{\prod_{i \in D_{0}} p_{θ^{*}} (Y_{i})}] \\ = \int_{A} \frac{\prod_{i \in D_{0}} p_{ψ} (y_{i})}{\prod_{i \in D_{0}} p_{θ^{*}} (y_{i})} \prod_{i \in D_{0}} p_{θ^{*}} (y_{i}) d y_{1} \dots d y_{n} \\ = \int_{A} \prod_{i \in D_{0}} p_{ψ} (y_{i}) d y_{1} \dots d y_{n} \\ \leq \prod_{i \in D_{0}} [\int p_{ψ} (y_{i}) d y_{i}] = 1 . \end{array}

Since ${\hat{θ}}_{1}$ is fixed when we condition on $D_{1}$ , we have

E_{θ^{*}} [T_{n} (θ^{*}) | D_{1}] = E_{θ^{*}} [\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ^{*})} | D_{1}] \leq 1 .

[6]

Now, using Markov’s inequality,

\begin{align} P_{θ^{*}} (θ^{*} \notin C_{n}) & = P_{θ^{*}} (T_{n} (θ^{*}) > \frac{1}{α}) \leq α E_{θ^{*}} [T_{n} (θ^{*})] \\ = α E_{θ^{*}} [\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ^{*})}] = α E_{θ^{*}} (E_{θ^{*}} [\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ^{*})} | D_{1}]) \\ \leq α . \end{align}

[7]

Remark 2:

The parametric setup adopted above generalizes easily to nonparametric settings as long as we can calculate a likelihood. For a collection of densities $P$ , and a true density $p^{*} \in P$ , suppose we use $D_{1}$ to identify ${\hat{p}}_{1} \in P$ and $D_{0}$ to calculate

T_{n} (p) = \prod_{i \in D_{0}} \frac{{\hat{p}}_{1} (Y_{i})}{p (Y_{i})} .

We then define $C_{n} ≔ {p \in P : T_{n} (p) \leq 1 / α}$ , and our previous argument ensures that $P_{p^{*}} (p^{*} \in C_{n}) \geq 1 - α$ .

A Universal Hypothesis Test.

Now we turn to hypothesis testing. Let $Θ_{0} \subset Θ$ be a possibly composite null set and consider testing

H_{0} : θ^{*} \in Θ_{0} v e r s u s θ^{*} \notin Θ_{0} .

[8]

The alternative above can be replaced by $θ^{*} \in Θ_{1}$ for any $Θ_{1} \subseteq Θ$ or by $θ^{*} \in Θ_{1} \ Θ_{0}$ . One way to test this hypothesis is based on the universal confidence set in Eq. 4. We simply reject the null hypothesis if $C_{n} ⋂ Θ_{0} = \emptyset$ . It is straightforward to see that if this test makes a type I error, then the universal confidence set must fail to cover $θ^{*}$ , and so the type I error of this test is at most $α$ .

We present an alternative method that is often computationally (and possibly statistically) more attractive. Let ${\hat{θ}}_{1}$ be any estimator constructed from $D_{1}$ , and let

{\hat{θ}}_{0} ≔ \underset{θ \in Θ_{0}}{argmax} L_{0} (θ)

be the MLE under $H_{0}$ constructed from $D_{0}$ . Then the universal test, which we call the split likelihood-ratio test (split LRT), is defined as

reject H_{0} if U_{n} > 1 / α, where U_{n} = \frac{L_{0} ({\hat{θ}}_{1})}{L_{0} ({\hat{θ}}_{0})} .

[9]

Similarly, we can define the cross-fit LRT as

reject H_{0} if W_{n} > 1 / α, where W_{n} = \frac{U_{n} + U_{n}^{swap}}{2},

[10]

where, as before, $U_{n}^{swap}$ is calculated like $U_{n}$ after swapping the roles of $D_{0}$ and $D_{1}$ .

Theorem 3.

The split and cross-fit LRTs control the type I error at $α$ ; i.e., $sup_{θ^{*} \in Θ_{0}} P_{θ^{*}} (U_{n} > 1 / α) \leq α$ .

The proof is straightforward. We prove it for the split LRT, but once again the cross-fit proof is identical. Suppose that $H_{0}$ is true and $θ^{*} \in Θ_{0}$ is the true parameter. By Markov’s inequality, the type I error is

\begin{array}{l} P_{θ^{*}} (U_{n} > 1 / α) & = P_{θ^{*}} (L_{0} ({\hat{θ}}_{1}) / L_{0} ({\hat{θ}}_{0}) > 1 / α) \\ \leq α E_{θ^{*}} [\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} ({\hat{θ}}_{0})}] \overset{(i)}{\leq} α E_{θ^{*}} [\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ^{*})}] \overset{(ii)}{\leq} α . \end{array}

Above, inequality (i) uses the fact that $L_{0} ({\hat{θ}}_{0}) \geq L_{0} (θ^{*})$ which is true when ${\hat{θ}}_{0}$ is the MLE, and inequality (ii) follows by conditioning on $D_{1}$ as argued earlier in Eq. 7.

Remark 4.

We may drop the use of $Θ, Θ_{0}, Θ_{1}$ above and extend the split LRT to a general nonparametric setup. Both tests can be used to test any null $H_{0} : p^{*} \in P_{0}$ against any alternative $H_{1} : p^{*} \in P_{1}$ . Importantly, no parametric assumption is needed on $P_{0}, P_{1}$ , and no relationship is imposed whatsoever between $P_{0}, P_{1}$ . As before, use $D_{1}$ to identify ${\hat{p}}_{1} \in P_{1}$ , use $D_{0}$ to calculate the MLE ${\hat{p}}_{0} \in P_{0}$ , and define $U_{n} = \prod_{i \in D_{0}} \frac{{\hat{p}}_{1} (Y_{i})}{{\hat{p}}_{0} (Y_{i})}$ .

We call these procedures universal to mean that they are valid in finite samples with no regularity conditions. Constructions like this are reminiscent of ideas used in sequential settings where an estimator is computed from past data and the likelihood is evaluated on current data; we expand on this in Section 7.

We note in passing that another universal set is the following. Define $C = \{θ : \int_{Θ} L (ψ) d Π (ψ) / L (θ) \leq 1 / α\},$ where $L$ is the full likelihood (from all of the data) and $Π$ is any prior. This also has the same coverage guarantee but requires specifying a prior and doing an integral. In irregular or nonparametric models, the integral will typically be intractable.

Perspective: Poor Man’s Chernoff Bound.

At first glance, the reader may worry that Markov’s inequality seems like a weak tool to use, resulting in an underpowered conservative test or confidence interval. However, this is not the right perspective. One should really view our proof as using a “poor man’s Chernoff bound.”

For a regular model, we would usually compare the log-likelihood ratio to the $(1 - α)$ quantile of a χ² distribution (with degrees of freedom related to the difference in dimensionality of the null and alternate models). Instead, we compare the log-split-likelihood ratio to $\log (1 / α)$ , which scales like the $(1 - α)$ quantile of a χ² distribution with one degree of freedom.

In any case, instead of finding the asymptotic distribution of $\log U_{n}$ (usually having a moment-generating function, like a χ²), our proof should be interpreted as using the simpler but nontrivial fact that $E_{θ^{*}} [e^{\log (U_{n})}] \leq 1$ . Hence we are really using the fact that $\log U_{n}$ has an exponential tail, just as an asymptotic argument would.

A true Chernoff-style bound for a χ² random variable would have bounded $E_{θ^{*}} [e^{a \log (U_{n})}]$ by an appropriate function of $a$ and then optimized over the choice of $a > 0$ to obtain a tight bound. Our methods correspond to choosing $a = 1$ , leading us to call the technique a poor man’s Chernoff bound. The key point is that our methods should be viewed as using Markov’s inequality on the exponential of the random variable of interest.

Perspective: In-Sample versus Out-of-Sample Likelihood.

We may rewrite the universal set as

C_{n} = \{θ \in Θ : 2 \log \frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ)} \leq 2 \log (1 / α)\} .

For a regular model, it is natural to compare the above expression to the usual LRT-based set $A_{n}$ from Eq. 1. At first, it may visually seem like the LRT-based set uses the threshold $c_{α, d}$ , while the universal set uses $2 \log (1 / α)$ which is much smaller in high dimensions. However, a key point to keep in mind is that comparing the numerators of the test statistics in both cases, the classical likelihood-ratio set uses an in-sample likelihood and the split LRS confidence set uses an out-of-sample likelihood. Hence, simply comparing the thresholds does not suffice to draw a conclusion about the relative sizes of the confidence sets. We next check that for regular models, the size of the universal set indeed shrinks at the right rate.

2. Sanity Check: Regular Models

Although universal methods are not needed for well-behaved models, it is worth checking their behavior in these cases. We expect that $C_{n}$ would not have optimal size but we would hope that it still shrinks at the optimal rate. We now confirm that this is true.

Throughout this example we treat the dimension as a fixed constant before subsequently turning our attention to an example where we more carefully track the dependence of the confidence set diameter on dimension. In this and subsequent sections we use standard stochastic order notation for convergence in probability $o_{p}$ and boundedness in probability $O_{p}$ (7). We make the following regularity assumptions (see for instance ref. 7 for a detailed discussion of these conditions):

1)
The statistical model is identifiable; i.e., for any $θ \neq θ^{*}$ it is the case that $P_{θ} \neq P_{θ^{*}}$ . The statistical model is differentiable in quadratic mean (DQM) at $θ^{*}$ ; i.e., there exists a function $s_{θ^{*}}$ such that
$\begin{array}{l} \int {[\sqrt{p_{θ}} - \sqrt{p_{θ^{*}}} - \frac{1}{2} {(θ - θ^{*})}^{T} s_{θ^{*}} \sqrt{p_{θ^{*}}}]}^{2} d μ \\ = o ({‖ θ - θ^{*} ‖}^{2}), as θ \to θ^{*} . \end{array}$
2)
The parameter space $Θ \subset R^{d}$ is compact, and the log-likelihood is a smooth function of $θ$ ; i.e., there is a measurable function $ℓ$ with $sup_{θ} P_{θ} ℓ^{2} < \infty$ such that for any $θ_{1}, θ_{2} \in Θ$
$| \log p_{θ_{1}} (x) - \log p_{θ_{2}} (x) | \leq ℓ (x) ‖ θ_{1} - θ_{2} ‖ .$
3)
A consequence of the DQM condition is that the Fisher information matrix
$I (θ^{*}) ≔ E_{θ^{*}} [s_{θ^{*}} s_{θ^{*}}^{T}]$
is well defined, and we assume it is nondegenerate.

Under these conditions the optimal confidence set has (expected) diameter $O (1 / \sqrt{n})$ . Our first result shows that the same is true of the universal set, provided that the initial estimate ${\hat{θ}}_{1}$ is $\sqrt{n}$ consistent; i.e., $‖ {\hat{θ}}_{1} - θ^{*} ‖ = O_{p} (1 / \sqrt{n})$ . Under the conditions of our theorem, this consistency condition is satisfied when ${\hat{θ}}_{1}$ is the MLE but our result is more generally applicable.

Theorem 5.

Suppose that ${\hat{θ}}_{1}$ is a $\sqrt{n}$ -consistent estimator of $θ^{*}$ . Under the assumptions above, the split LRT confidence set has diameter $O_{p} (\sqrt{\log (1 / α) / n})$ .

A proof of this result is in SI Appendix. At a high level, to bound the diameter of the split LRT set it suffices to show that for any $θ$ sufficiently far from $θ^{*}$ , it is the case that

\frac{L_{0} (θ)}{L_{0} ({\hat{θ}}_{1})} \leq α .

To establish this, note that we can write this condition as

\log \frac{L_{0} (θ)}{L_{0} (θ^{*})} + \log \frac{L_{0} (θ^{*})}{L_{0} (\hat{θ_{1}})} \leq \log (α) .

Bounding the first term requires showing if we consider any $θ$ sufficiently far from $θ^{*}$ , its likelihood is small relative to the likelihood of $θ^{*}$ . We build on the work of Wong and Shen (8) who provide uniform upper bounds on the likelihood ratio under technical conditions which ensure that the statistical model is not too big. Conversely, to bound the second term we need to argue that if ${\hat{θ}}_{1}$ is sufficiently close to $θ^{*}$ , then it must be the case that their likelihoods cannot be too different. This in turn follows by exploiting the DQM condition.

Analyzing the Nonparametric Split LRT.

While our previous result focused on the diameter of the split LRT set in parametric problems, similar techniques also yield results in the nonparametric case. In this case, since we have no underlying parameter space, it will be natural to measure the diameter of our confidence set in terms of some metric on probability distributions. We consider bounding the diameter of our confidence set in the Hellinger metric. Formally, for two distributions $P$ and $Q$ the (squared) Hellinger distance is defined as

H^{2} (P, Q) = \frac{1}{2} \int {(\sqrt{d P} - \sqrt{d Q})}^{2} .

We will also require the use of the $χ^{2}$ divergence given by

χ^{2} (P, Q) = \int {(\frac{d P}{d Q} - 1)}^{2} d Q,

assuming that $P$ is absolutely continuous with respect to $Q$ . Roughly, and analogous to our development in the parametric case, to bound the diameter of the split LRT confidence set, we need to ensure that our statistical model $P$ is not too large and further that our initial estimate ${\hat{p}}_{1}$ is sufficiently close to $p^{*}$ .

To measure the size of $P$ we use its Hellinger bracketing entropy. Denote by $\log N (u, F)$ the Hellinger bracketing entropy of the class of distributions $F$ where the bracketing functions are separated by at most $u$ in the Hellinger distance (we refer to ref. 8 for a precise definition). We suppose that the bracketing entropy of $P$ is not too large; i.e., for some $ϵ_{n} > 0$ we have that for some constant $c > 0$ ,

\int_{ϵ_{n}^{2}}^{ϵ_{n}} \sqrt{\log (N (u, P))} d u \leq c \sqrt{n} ϵ_{n}^{2} .

[11]

Although we do not explore this in detail, we note in passing that the smallest value $ϵ_{n}$ for which the above condition is satisfied provides an upper bound on the rate of convergence of the nonparametric MLE in the Hellinger distance (8). To characterize the quality of ${\hat{p}}_{1}$ we use the $χ^{2}$ divergence. Concretely, we suppose that

χ^{2} (p^{*}, {\hat{p}}_{1}) \leq O_{p} (η_{n}^{2}) .

[12]

Theorem 6.

Under conditions Eqs. 11 and 12, the split LRT confidence set has Hellinger diameter upper bounded by $O_{p} (η_{n} + ϵ_{n} + \sqrt{\log (1 / α) / n})$ .

Comparing LRT to Split LRT for the Multivariate Normal Case.

In the previous calculation we treated the dimension of the parameter space as fixed. To understand the behavior of the method as a function of dimension in the regular case, suppose that $Y_{1}, \dots, Y_{n} \sim N_{d} (θ, I)$ , where $θ \in R^{d}$ . Recalling that we use $c_{α, d}$ and $z_{α}$ to denote the upper $α$ quantiles of the $χ_{d}^{2}$ and standard Gaussian, respectively, the usual confidence set for $θ$ based on the LRT is

\begin{array}{l} A_{n} & = \{θ : {‖ θ - \bar{Y} ‖}^{2} \leq \frac{c_{α, d}}{n}\} \\ = \{θ : {‖ θ - \bar{Y} ‖}^{2} \leq \frac{d + \sqrt{2 d} z_{α} + o (\sqrt{d})}{n}\}, \end{array}

where the second form follows from the normal approximation of the $χ_{d}^{2}$ distribution. For the universal set, we use the sample average from $D_{1}$ as our initial estimate ${\hat{θ}}_{1}$ . Denoting the sample means ${\bar{Y}}_{1}$ and ${\bar{Y}}_{0}$ we see that

C_{n} = \{θ : \log L_{0} ({\bar{Y}}_{1}) - \log L_{0} (θ) \leq \log (1 / α)\},

which is the set of $θ$ such that

- (\frac{n}{2}) \frac{{‖ {\bar{Y}}_{0} - {\bar{Y}}_{1} ‖}^{2}}{2} + (\frac{n}{2}) \frac{{‖ θ - {\bar{Y}}_{0} ‖}^{2}}{2} \leq \log (\frac{1}{α}) .

In other words, we may rewrite

C_{n} = \{θ : {‖ θ - {\bar{Y}}_{0} ‖}^{2} \leq \frac{4}{n} \log (\frac{1}{α}) + {‖ {\bar{Y}}_{0} - {\bar{Y}}_{1} ‖}^{2}\} .

Next, note that ${‖ {\bar{Y}}_{0} - {\bar{Y}}_{1} ‖}^{2} = O_{p} (d / n)$ , so both sets have radii $O_{p} (d / n)$ . Precisely, the squared radius $R_{n}^{2}$ of $C_{n}$ is

\begin{array}{l} R_{n}^{2} & \overset{d}{=} \frac{4 \log (1 / α) + 4 χ_{d}^{2}}{n} \\ \overset{d}{=} \frac{4 \log (1 / α) + 4 d + \sqrt{32 d} Z + O_{p} (\sqrt{d})}{n}, \end{array}

where $Z$ is an independent standard Gaussian. So both their squared radii share the same scaling with $d$ and $n$ , and for large $d$ and constant $α$ , the squared radius of $C_{n}$ is about 4 times larger than that of $A_{n}$ .

3. Examples

Mixture Models.

As a proof of concept, we do a small simulation to check the type I error and power for mixture models. Specifically, let $Y_{1}, \dots, Y_{2 n} \sim P$ , where $Y_{i} \in R$ . We want to distinguish the hypotheses in Eq. 2. For this brief example, we take $k_{0} = 1$ and $k_{1} = 2$ .

Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is the likelihood-ratio statistic but, as mentioned earlier, this has an intractable limiting distribution. To the best of our knowledge, the only practical test for the above hypothesis with a tractable limiting distribution is the EM test due to ref. 4. This very clever test is similar to the likelihood-ratio test except that it includes some penalty terms and requires the maximization of some of the parameters to be restricted. However, the test requires choosing some tuning parameters and, more importantly, it is restricted to one-dimensional problems. There is no known confidence set for mixture problems with guaranteed coverage properties. Another approach is based on the bootstrap (5) but there is no proof of the validity of the bootstrap for mixtures.

Fig. 1 shows the power of the test when $n = 200$ and ${\hat{θ}}_{1}$ is the MLE under the full model $M_{2}$ . The true model is taken to be $(1 / 2) ϕ (y; - μ, 1) + (1 / 2) ϕ (y; μ, 1)$ , where $ϕ$ is a normal density with mean $μ$ and variance 1. The null corresponds to $μ = 0$ . We take $α = 0.1$ and the MLE is obtained by the EM algorithm, which we assume converges on this simple problem. Understanding the local and global convergence (and nonconvergence) of the EM algorithm to the MLE is an active research area but is beyond the scope of this paper (refs. 9–11 and references therein). As expected, the test is conservative with type I error near 0 but has reasonable power when $μ > 1$ .

Fig. 1 also shows the power of the bootstrap test (5). Here, the $P$ value is obtained by bootstrapping the LRS under the estimated null distribution. As expected, this has higher power than the universal test since it does not split the data. In this simulation, both tests control the type I error, but unfortunately the bootstrap test does not have any guarantee on the type I error, even asymptotically. The lower power of the universal test is the price paid for having a finite-sample guarantee. It is also worth noting that the bootstrap test requires running the EM algorithm for each bootstrap sample while the universal test requires only one EM run.

Model Selection Using Sieves.

Sieves are a general approach to nonparametric inference. A sieve (12) is a sequence of nested models $P_{1} \subset P_{2} \subset \dots$ . If we assume that the true density $p^{*}$ is in $P_{j}$ for some (unknown) $j$ , then universal testing can be used to choose the model. One possibility is to test $H_{j} : p^{*} \in P_{j}$ one by one for $j = 1,2, \dots$ . We reject $H_{j}$ if

\prod_{i \in D_{0}} \frac{{\hat{p}}_{j + 1} (Y_{i})}{{\hat{p}}_{j} (Y_{i})} > 1 / α,

where ${\hat{p}}_{j}$ is the MLE in model $P_{j}$ . Then we take $\hat{j}$ to be the first $j$ such that $H_{j}$ is not rejected and proclaim that $p^{*} \in P_{j}$ for some $j \geq \hat{j}$ . Even though we test multiple different hypotheses and stop at a random $\hat{j}$ , this procedure still controls the type I error, meaning that

P_{p^{*}} (p^{*} \in P_{\hat{j} - 1}) \leq α,

meaning that our proclamation is correct with high probability. The reason we do not need to correct for multiple testing is because a type I error can occur only once we have reached the first $j$ such that $p^{*} \in P_{j}$ .

A simple application is to choose the number of mixture components in a mixture model, as discussed in the previous example. Here are some other interesting examples in which the aforementioned ideas yield valid tests and model selection using sieves: 1)testing the number of hidden states in a hidden Markov model (the MLE is computable using the Baum–Welch algorithm), 2) testing the number of latent factors in a factor model, and 3) testing the sparsity level in a high-dimensional linear model $Y = X β + ϵ$ (under $H_{0} : β$ is $k$ sparse, the MLE corresponds to best-subset selection).

Whenever we can compute the MLE (specifically, the likelihood it achieves), then we can run our universal test, and we can do model selection using sieves. We will later see that an upper bound of the maximum likelihood suffices and is sometimes achievable by minimizing convex relaxations of the negative log-likelihood.

Nonparametric Example: Shape-Constrained Inference.

A density $p$ is log-concave if $p = e^{g}$ for some concave function $g$ . Consider testing $H_{0} : p$ is log-concave versus $H_{1} : p$ is not log-concave. Let $P_{0}$ be the set of log-concave densities and let ${\hat{p}}_{0}$ denote the nonparametric maximum-likelihood estimator over $P_{0}$ computed using $D_{0}$ (13) which can be computed in polynomial time (14). Let ${\hat{p}}_{1}$ be any nonparametric density estimator such as the kernel density estimator (15) fitted on $D_{1}$ . In this case, the universal test is to reject $H_{0}$ when

\prod_{i \in D_{0}} \frac{{\hat{p}}_{1} (Y_{i})}{{\hat{p}}_{0} (Y_{i})} > \frac{1}{α} .

To the best of our knowledge this is the first test for this problem with finite-sample guarantee. Under the assumption that $p \in P_{0}$ , the universal confidence set is

C_{n} = \{p \in P_{0} : \prod_{i \in D_{0}} p (Y_{i}) \geq α \prod_{i \in D_{0}} {\hat{p}}_{1} (Y_{i})\} .

While the aforementioned test can be efficiently performed, the set $C_{n}$ may be hard to explicitly represent, but we can check whether a distribution $p \in C_{n}$ efficiently.

Positive Dependence (Multivariate Total Positivity of Order 2).

The split LRT solves a variety of open problems related to testing for a general notion of positive dependence called multivariate total positivity of order 2 ( ${M T P}_{2}$ ) (16). The convex optimization problem of maximum-likelihood estimation in Gaussian models under total positivity was recently solved (17), but in ref. 17, example 5.8 and the following discussion, they state that the testing problem is still open. Given data from a multivariate distribution $p$ , consider testing $H_{0} : p$ is Gaussian ${M T P}_{2}$ against $H_{1} : p$ is Gaussian (or an even more general alternative). Since proposition 2.2 in ref. 17 shows that the MLE under the null can be efficiently calculated, our universal test is applicable.

In fact, calculating the MLE in any ${M T P}_{2}$ exponential family is a convex optimization problem (ref. 18, theorem 3.1), thus making a test immediately feasible. As a particularly interesting special case, ref. 18, section 5.1 provides an algorithm for computing the MLE for ${M T P}_{2}$ Ising models. Testing $H_{0} : p$ is Ising ${M T P}_{2}$ against $H_{1} : p$ is Ising is stated as an open problem in ref. 18, section 6, and is solved by our universal test. (We remark that even though the ${M T P}_{2}$ MLE is efficiently computable, evaluating the maximum likelihood in the Ising case may still take $O (2^{d})$ time for a $d$ -dimensional problem.)

Finally, ${M T P}_{2}$ can be combined with log-concavity, uniting shape constraints and dependence. General existence and uniqueness properties of the MLE for totally positive log-concave densities have been recently derived (19), along with efficient algorithms to compute the MLE. Our methods immediately yield a test for $H_{0} : p$ is ${M T P}_{2}$ log-concave against $H_{1} : p$ is log-concave.

All of the above models were singular, and hence the LRS has been hard to study. In some cases, its asymptotic null distribution is known to be a weighted sum of χ² distributions, where the weights are rather complicated properties of the distributions (usually unknown to the practitioner). In contrast, the split LRT is applicable without assumptions, and its validity is nonasymptotic.

Independence versus Conditional Independence.

Consider data that are trivariate vectors of the form $(X_{1 i}, X_{2 i}, X_{3 i})$ which are modeled as trivariate normal. The goal is to test $H_{0} : X_{1}$ and $X_{2}$ are independent versus $H_{1} : X_{1}$ and $X_{2}$ are independent given $X_{3}$ . The motivation for this test is that this problem arises in the construction of causal graphs. It is surprisingly difficult to test these nonnested hypotheses. Indeed, Guo and Richardson (20) study carefully the subtleties of the problem and they show that the limiting distribution of the LRS is complicated and cannot be used for testing. They propose a new test based on a concept called envelope distributions. Despite the fact that the hypotheses are nonnested, the universal test is applicable and can be used quite easily for this problem. Further, one can also flip $H_{0}$ and $H_{1}$ and test for conditional independence in the Gaussian setting as well. We leave it to future work to compare the power of the universal test and the envelope test.

Cross-Fitting Can Beat Splitting: Uniform Distribution.

In all previous examples, the split LRT is a reasonable choice. However, in this example, the cross-fit approach easily dominates the split approach. Note that this is a case where we would not recommend our universal tests since there are well-studied standard confidence intervals in this model. The example is just meant to bring out the difference between the split and cross-fit approaches.

Suppose that $p_{θ}$ is the uniform density on $[0, θ]$ . Let us take ${\hat{θ}}_{1}$ to be the MLE from $D_{1}$ . Thus, ${\hat{θ}}_{1}$ is the maximum of the data points in $D_{1}$ . Now $L_{0} (θ) = θ^{- n} I (θ \geq {\hat{θ}}_{0})$ , where ${\hat{θ}}_{0}$ is the maximum of the data points in $D_{0}$ . It follows that $C_{n} = [0, \infty)$ whenever ${\hat{θ}}_{1} < {\hat{θ}}_{0}$ which happens with probability 1/2. The set $C_{n}$ has the required coverage but is too large to be useful. This happens because the densities have different support. A similar phenomenon occurs when testing $H_{0} : θ \leq A$ versus $H_{1} : θ \in R^{+}$ for some fixed $A > 0$ , but not when testing against $H_{1} : θ > A$ . One can partially avoid this behavior by choosing ${\hat{θ}}_{1}$ to not be the MLE. However, the simplest way to avoid the degeneracy is to use the cross-fit approach, where we swap the roles of $D_{0}$ and $D_{1}$ , and average the resulting test statistics. Exactly one of two test statistics will be 0, and hence the average will be nonzero. Further, it is easy to show that this test and resulting interval are rate optimal, losing a constant factor due to data splitting over the standard tests and interval constructions. In more detail, the classical (exact) pivotal $1 - α$ confidence interval for $θ$ is $C_{2 n}^{'} = [\hat{θ}, \hat{θ} {(1 / α)}^{1 / (2 n)}]$ , where $\hat{θ}$ is the maximum of all of the data points. On the other hand, for ${\hat{θ}}_{1}, {\hat{θ}}_{0}$ defined above, assuming without loss of generality that ${\hat{θ}}_{0} \leq {\hat{θ}}_{1}$ a direct calculation shows that the cross-fit interval takes the form $C_{n} = [{\hat{θ}}_{0}, {\hat{θ}}_{1} {(2 / α)}^{1 / n}]$ . Ignoring constants, both these intervals have expected length $O (θ \log (1 / α) / n)$ .

4. Derandomization

The universal method involves randomly splitting the data and the final inferences will depend on the randomness of the split. This may lead to instability, where different random splits produce different results; in a related context, this has been called the “P-value lottery” (21).

We can get rid of or reduce the variability of our inferences, at the cost of more computation by using many splits, while maintaining validity of the method. The key property that we used in both the universal confidence set and the split LRT is that $E_{θ^{*}} [T_{n}] \leq 1$ , where $T_{n} = L_{0} ({\hat{θ}}_{1}) / L_{0} (\hat{θ})$ . Imagine that we obtained $B$ such statistics $T_{n, 1} \dots, T_{n, B}$ with the same property. Let

{\bar{T}}_{n} = B^{- 1} \sum_{j = 1}^{B} T_{n, j} .

Then we still have that $E_{θ^{*}} [{\bar{T}}_{n}] \leq 1$ and so inference using our universal methods can proceed using the combined statistic ${\bar{T}}_{n}$ . Note that this is true regardless of the dependence between the statistics.

Using the aforementioned idea, we can immediately design natural variants of the universal method:

•
K-fold. We can split the data once into $2 \leq K \leq n$ folds. Then repeat the following $K$ times: Use $K - 1$ folds to calculate ${\hat{θ}}_{1}$ and evaluate the likelihood ratio on the last fold. Finally, average the $K$ statistics. Alternatively, we could use onefold to calculate ${\hat{θ}}_{1}$ and evaluate the likelihood on the other $K - 1$ folds.
•
Subsampling. We do not need to split the data just once into $K$ folds. We can repeat the previous procedure for repeated random splits of the data into $K$ folds. We expect this to reduce variance that arises from the algorithmic randomness.
•
All splits. We can remove all algorithmic randomness by considering all possible splits. While this is computationally infeasible, the potential statistical gains are worth studying.

We remark that all these variants allow a large amount of flexibility. For example, in cross-fitting, ${\hat{θ}}_{1}$ need not be used the same way in both splits: It could be the MLE on one split, but a Bayesian estimator on another split. This flexibility could be useful if the user does not know which variant would lead to higher power in advance and would like to hedge across multiple natural choices. Similarly, in the $K$ -fold version, if a user is confused whether to evaluate the likelihood ratio on onefold or on $K - 1$ folds, then the user can do both and average the statistics.

Of course, with such flexibility comes the risk of an analyst cherry picking the variant used after looking at which form of averaging results in the highest LR (this would correspond to taking the maximum instead of the average of multiple variants), but this is a broader issue. For this reason (and this reason alone), the cross-fitting LRT proposed initially may be a useful default in practice, since it is both conceptually and computationally simple. We have already seen that (twofold) cross-fit inference improves over split inference drastically in the case of the uniform distribution discussed in the previous section. We leave a more detailed theoretical and empirical analysis of the power of these variants to future work.

5. Extensions

Profile Likelihood and Nuisance Parameters.

Suppose that we are interested in some function $ψ = g (θ)$ . Let

B_{n} = \{ψ : C_{n} ⋂ g^{- 1} (ψ) \neq \emptyset\},

where we define $g^{- 1} (ψ) = {θ : g (θ) = ψ}$ . By construction, $B_{n}$ is a $1 - α$ confidence set for $ψ$ . Defining the profile-likelihood function

L_{0}^{†} (ψ) = sup_{θ : g (θ) = ψ} L_{0} (θ),

[13]

we can rewrite $B_{n}$ as

B_{n} = \{ψ : \frac{L_{0} ({\hat{θ}}_{1})}{L_{0}^{†} (ψ)} \leq \frac{1}{α}\} .

[14]

In other words, the same data-splitting idea works for the profile likelihood too. As a particularly useful example, suppose $θ = (θ_{u}, θ_{n})$ , where $θ_{n}$ is a nuisance component; then we can define $g (θ) = θ_{u}$ to obtain a universal confidence set for only the component $θ_{u}$ we care about.

Upper Bounding the Null Maximum Likelihood.

Computing the MLE and/or the maximum likelihood (under the null) is sometimes computationally hard. Suppose one could come up with a relaxation $F_{0}$ of the null likelihood $L_{0}$ . This should be a proper relaxation in the sense that

max_{θ} F_{0} (θ) \geq max_{θ} L_{0} (θ) .

For example, $L_{0}$ may be defined as $- \infty$ outside its domain, but $F_{0}$ could extend the domain. As another example, instead of minimizing the negative log-likelihood which could be nonconvex and hence hard to minimize, we could minimize a convex relaxation. In such settings, define

{\hat{θ}}_{0}^{F} ≔ \underset{θ}{argmax} F_{0} (θ) .

If we define the test statistic

T_{n}^{'} ≔ \frac{L_{0} ({\hat{θ}}_{1})}{F_{0} ({\hat{θ}}_{0}^{F})},

then the split LRT may proceed using $T_{n}^{'}$ instead of $T_{n}$ . This is because $F_{0} ({\hat{θ}}_{0}^{F}) \geq L_{0} ({\hat{θ}}_{0})$ , and hence $T_{n}^{'} \leq T_{n}$ .

One particular case when this would be useful is the following. While discussing sieves, we had mentioned that testing the sparsity level in a high-dimensional linear model involves solving the best subset selection problem, which is nondeterministic polynomial-time hardness in the worst case. There exist well-known quadratic programming relaxations that are more computationally tractable. Another example is testing whether a random graph is a stochastic block model, for which semidefinite relaxations of the MLE are well studied (22); similar situations arise in communication theory (23) and angular synchronization (24).

The takeaway message is that it suffices to upper bound the maximum likelihood to perform inference.

Robustness via Powered Likelihoods.

It has been suggested by some authors (25–29) that inferences can be made robust by replacing the likelihood $L$ with the power likelihood $L^{η}$ for some $0 < η < 1$ . Note that

E_{θ} [{(\frac{L_{0} ({\hat{θ}}_{1})}{L_{0} (θ)})}^{η} | D_{1}] = \prod_{i \in D_{0}} \int p_{{\hat{θ}}_{1}}^{η} (y_{i}) p_{θ}^{1 - η} (y_{i}) d y_{i} \leq 1,

and hence all of the aforementioned methods can be used with the robustified likelihood as well. (The last inequality follows because the $η$ -Renyi divergence is nonnegative.)

Smoothed Likelihoods.

Sometimes the MLE is not consistent or it may not exist since the likelihood function is unbounded, and a (doubly) smoothed likelihood has been proposed as an alternative (30). For simplicity, consider a kernel $k (x, y)$ such that $\int k (x, y) d y = 1$ for any $x$ , for example a Gaussian or Laplace kernel. For any density $p_{θ}$ , let its smoothed version be denoted

{\tilde{p}}_{θ} (y) ≔ \int k (x, y) p_{θ} (x) d x .

Note that ${\tilde{p}}_{θ}$ is also a probability density. Denote the smoothed empirical density based on $D_{0}$ as

{\tilde{p}}_{n} ≔ \frac{1}{| D_{0} |} \sum_{i \in D_{0}} k (X_{i}, \cdot) .

Define the smoothed maximum-likelihood estimator as the Kullback–Leibler (KL) projection of ${\tilde{p}}_{n}$ onto ${{\tilde{p}}_{θ}}_{θ \in Θ_{0}}$ ,

{\tilde{θ}}_{0} ≔ \arg min_{θ \in Θ_{0}} K ({\tilde{p}}_{n}, {\tilde{p}}_{θ}),

where $K (P, Q)$ denotes the KL divergence between $P$ and $Q$ . If we define the smoothed likelihood on $D_{0}$ as

{\tilde{L}}_{0} (θ) ≔ \prod_{i \in D_{0}} \exp \int k (X_{i}, y) \log {\tilde{p}}_{θ} (y) d y,

then it can be checked that ${\tilde{θ}}_{0}$ maximizes the smoothed likelihood; that is, ${\tilde{θ}}_{0} = \arg max_{θ \in Θ_{0}} {\tilde{L}}_{0} (θ) .$ As before, let ${\hat{θ}}_{1} \in Θ$ be any estimator based on $D_{1}$ . The smoothed split LRT is defined analogous to Eq. 9 as

reject H_{0} if {\tilde{U}}_{n} > 1 / α, where {\tilde{U}}_{n} = \frac{{\tilde{L}}_{0} ({\hat{θ}}_{1})}{{\tilde{L}}_{0} ({\tilde{θ}}_{0})} .

[15]

We now verify that the smoothed split LRT controls type I error. First, for any fixed $ψ \in Θ$ , we have

\begin{array}{l} E_{θ^{*}} [\frac{{\tilde{L}}_{0} (ψ)}{{\tilde{L}}_{0} ({\tilde{θ}}_{0})}] & \overset{(i)}{\leq} E_{θ^{*}} [\frac{{\tilde{L}}_{0} (ψ)}{{\tilde{L}}_{0} (θ^{*})}] \\ = \prod_{i \in D_{0}} \int \exp (\int k (x, y) \log \frac{{\tilde{p}}_{ψ} (y)}{{\tilde{p}}_{θ^{*}} (y)} d y) p_{θ^{*}} (x) d x \\ \overset{(ii)}{\leq} \int (\int k (x, y) \frac{{\tilde{p}}_{ψ} (y)}{{\tilde{p}}_{θ^{*}} (y)} d y) p_{θ^{*}} (x) d x \\ = \int (\frac{\int k (x, y) p_{θ^{*}} (x) d x}{{\tilde{p}}_{θ^{*}} (y)}) {\tilde{p}}_{ψ} (y) d y \\ = \int {\tilde{p}}_{ψ} (y) d y = 1 . \end{array}

Above, step $(i)$ is because ${\tilde{θ}}_{0}$ maximizes the smoothed likelihood, and step $(ii)$ follows by Jensen’s inequality. An argument mimicking Eqs. 6 and 7 completes the proof. As a last remark, similar to the unsmoothed case, note that upper bounding the smoothed maximum likelihood under the null also suffices.

Conditional Likelihood for Non-i.i.d. Data.

Our presentation so far has assumed that the data are drawn i.i.d. from some distribution under the null. However, this is not really required (even under the null) and was assumed for expositional simplicity. All that is needed is that we can calculate the likelihood on $D_{0}$ conditional on $D_{1}$ (or vice versa). For example, this could be tractable in models involving sampling without replacement from an urn with $M ≫ n$ balls. Here $θ$ could represent the unknown number of balls of different colors. Such hypergeometric sampling schemes result in non-i.i.d. data, but conditional on one subset of data (for example how many red, green, and blue balls were sampled from the urn in that subset), one can evaluate the conditional likelihood of the second half of the data and maximize it, rendering it possible to apply our universal tests and confidence sets.

6. Misspecification and Convex Model Classes

There are some natural examples of convex model classes (31, 32), including 1) all mixtures (potentially infinite) of a set of base distributions, 2) distributions with the first moment specified/bounded and possibly other moments bounded (e.g., first moment equals zero, second moment bounded by one), 3) the set of (coordinate-wise) monotonic densities with the same support, 4) unimodal densities with the same mode, 5) densities that are symmetric about the same point, 6) distributions with the same median or multiple quantiles (e.g., median = 0, 0.9 quantile = 2), 7) the set of all $K$ -tuples $(P_{1}, \dots, P_{K})$ of distributions satisfying a fixed partial stochastic ordering (e.g., all triplets $(P_{1}, P_{2}, P_{3})$ such that $P_{1} ⪯ P_{2}$ and $P_{1} ⪯ P_{3}$ , where $⪯$ is the usual stochastic ordering), and 8) the set of convex densities with the same support. Some cases like 6) and 7) also result in weakly closed convex sets, as does case 2) for a specified mean. (Several of these examples also apply in discrete settings such as constrained multinomials.)

It is often possible to calculate the MLE over these convex model classes using convex optimization; for example see refs. 33 and 34 for case 7). This renders our universal tests and confidence sets immediately applicable. However, in this special case, it is also possible to construct additional tests, and the universal confidence set has some nontrivial guarantees if the model is misspecified.

Model Misspecification.

Suppose the data come from a distribution $Q$ with density $q \notin P_{Θ} \equiv {p_{θ}}_{θ \in Θ}$ , meaning that the model is misspecified and the true distribution does not belong to the considered model. In this case, what does the universal set $C_{n}$ defined in Eq. 4 contain? We will answer this question when the set of measures/densities $P_{Θ}$ is convex. Define the Kullback–Leibler divergence of $q$ from $P_{Θ}$ as

K (q, P_{Θ}) ≔ inf_{θ \in Θ} K (q, p_{θ}) .

Following definition 4.2 in Li’s (31) PhD thesis, a function $p^{*} \equiv p_{q \to Θ}^{*}$ is called the reversed information projection (RIPR) of $q$ onto $P_{Θ}$ if for every sequence $p_{n}$ with $K (q, p_{n}) \to K (q, P_{Θ})$ , we have $\log p_{n} \to \log p^{*}$ in $L^{1} (Q)$ . Theorem 4.3 in ref. 31 proves that $p^{*}$ exists and is unique, satisfies $K (q, p^{*}) = K (q, P_{Θ})$ , and

\forall θ \in Θ, E_{Y \sim q} [\frac{p_{θ} (Y)}{p^{*} (Y)}] \leq 1 .

[16]

The above statement can be loosely interpreted as “if the data come from $q \notin P_{Θ}$ , its RIPR $p^{*}$ will have higher likelihood than any other model in expectation.” We discuss this condition further at the end of this subsection.

It might be reasonable to ask whether the universal set contains $p^{*}$ . For various technical reasons (detailed in ref. 31) it is not the case, in general, that $p^{*}$ belongs to the collection $P_{Θ}$ . Since the universal set considers densities in $P_{Θ}$ only by construction, it cannot possibly contain $p^{*}$ in general. However, when $p^{*}$ is a density in $P_{Θ}$ , then it is indeed covered by our universal set.

Proposition 7.

Suppose that the data come from $q \notin P_{Θ}$ . If $P_{Θ}$ is convex and there exists a density $p^{*} \in P_{Θ}$ such that $K (q, p^{*}) = inf_{θ \in Θ} K (q, p_{θ})$ , then we have $P_{q} (p^{*} \in C_{n}) \geq 1 - α$ .

The proof is short. Examining the proof of Theorem 1, we must simply verify that for each $i \in D_{0}$ , we have

E_{q} [\frac{p_{{\hat{θ}}_{1}} (Y_{i})}{p^{*} (Y_{i})}] \leq 1,

which follows from Eq. 16. Here is a heuristic argument for why Eq. 16 holds when $p^{*} \in P_{Θ}$ . For any $θ \in Θ$ , note that $K (q, P_{Θ}) = K (q, p^{*}) = min_{α \in [0,1]} K (q, α p^{*} + (1 - α) p_{θ})$ since $P_{Θ}$ is convex. The Karush–Kuhn–Tucker condition for this optimization problem is that gradient with respect to $α$ is negative at $α = 1$ (the minimizer). Exchanging derivative and integral immediately yields Eq. 16. This argument is formalized in ref. 31, chap. 4.

An Alternate Split LRT (RIPR Split LRT).

We return back to the well-specified case for the rest of this paper. First note that the fact in Eq. 16 can be rewritten as

\forall θ \in Θ, E_{Y \sim p_{θ}} [\frac{q (Y)}{p^{*} (Y)}] \leq 1,

[17]

which is informally interpreted as “if the data come from $p_{θ}$ , then any alternative $q \notin P_{Θ}$ will have lower likelihood than its RIPR $p^{*}$ in expectation.” This motivates the development of an alternate RIPR split LRT to test composite null hypotheses that is defined as follows. As before, we divide the data into two parts, $D_{0}$ and $D_{1}$ , and let ${\hat{θ}}_{1} \in Θ_{1}$ be any estimator found using only $D_{1}$ . Now, define $p_{0}^{*}$ to be the RIPR of $p_{{\hat{θ}}_{1}}$ onto the null set ${p_{θ}}_{θ \in Θ_{0}}$ . The RIPR split LRT rejects the null if

R_{n} \equiv \prod_{i \in D_{0}} \frac{p_{{\hat{θ}}_{1}} (Y_{i})}{p_{0}^{*} (Y_{i})} > 1 / α .

The main difference from the original MLE split LRT is that earlier we ignored ${\hat{θ}}_{1}$ and simply calculated the MLE ${\hat{θ}}_{0}$ under the null based on $D_{0}$ .

Proposition 8.

If ${p_{θ}}_{θ \in Θ}$ is a convex set of densities, then $sup_{θ_{0} \in Θ_{0}} P_{θ_{0}} (R_{n} > 1 / α) \leq α$ .

The fact that $p_{0}^{*}$ is potentially not an element of ${p_{θ}}_{θ \in Θ_{0}}$ does not matter here. The validity of the test follows exactly the same logic as the MLE split LRT, observing that Eq. 17 implies that for any true $θ^{*} \in Θ_{0}$ , we have

E_{p_{θ^{*}}} [\frac{p_{{\hat{θ}}_{1}} (Y_{i})}{p_{0}^{*} (Y_{i})}] \leq 1 .

Without sample splitting and with a fixed alternative distribution, the RIPR LRT has been recently studied (35). When $P_{Θ}$ is convex and the RIPR split LRT is implementable, meaning that it is computationally feasible to find the RIPR or evaluate its likelihood, then this test can be more powerful than the MLE split LRT. Specifically, if the RIPR is actually a density in the null set, then

R_{n} = \prod_{i \in D_{0}} \frac{p_{{\hat{θ}}_{1}} (Y_{i})}{p_{0}^{*} (Y_{i})} \geq \prod_{i \in D_{0}} \frac{p_{{\hat{θ}}_{1}} (Y_{i})}{p_{{\hat{θ}}_{0}} (Y_{i})} = U_{n},

since ${\hat{θ}}_{0}$ maximizes the denominator among null densities. Because of the restriction to convex sets, and since there exist many more subroutines to calculate the MLE over a set than to find the RIPR, the MLE split LRT is more broadly applicable than the RIPR split LRT.

7. Anytime $P$ Values and Confidence Sequences

Just like the sequential likelihood-ratio test (36) extends the LRT, the split LRT has a simple sequential extension. Similarly, the confidence set can be extended to a “confidence sequence” (37).

Suppose the split LRT failed to reject the null. Then we are allowed to collect more data and update the test statistic (in a particular fashion) and check if the updated statistic crosses $1 / α$ . If it does not, we can further collect more data and reupdate the statistic, and this process can be repeated indefinitely. Importantly we do not need any correction for repeated testing; this is primarily because the statistic is upper bounded by a nonnegative martingale. We describe the procedure next in the case when each additional dataset is of size one, but the same idea applies when we collect data in groups.

The Running MLE Sequential LRT.

Consider the following, more standard, sequential testing/estimation setup. We observe an i.i.d. sequence $Y_{1}, Y_{2}, \dots$ from $P_{θ^{*}}$ . We want to test the hypothesis in Eq. 8. Let ${\hat{θ}}_{1, t - 1}$ be any nonanticipating estimator based on the first $t - 1$ samples, for example the MLE, ${a r g m a x}_{θ \in Θ_{1}} \prod_{i = 1}^{t - 1} p_{θ} (Y_{i})$ , or a regularized version of it to avoid misbehavior at small sample sizes. Denote the null MLE as

{\hat{θ}}_{0, t} = \underset{θ \in Θ_{0}}{argmax} \prod_{i = 1}^{t} p_{θ} (Y_{i}) .

At any time $t$ , reject the null and stop if

M_{t} ≔ \frac{\prod_{i = 1}^{t} p_{{\hat{θ}}_{1, i - 1}} (Y_{i})}{\prod_{i = 1}^{t} p_{{\hat{θ}}_{0, t}} (Y_{i})} > 1 / α .

This test is computationally expensive: We must calculate ${\hat{θ}}_{1, t - 1}$ and ${\hat{θ}}_{0, t}$ at each step. In some cases, these may be quick to calculate by warm starting from ${\hat{θ}}_{1, t - 2}$ and ${\hat{θ}}_{0, t - 1}$ . For example, the updates can be done in constant time for exponential families, since the MLE is often a simple function of the sufficient statistics. However, even in these cases, the denominator takes time $O (t)$ to recompute at step $t$ .

The following result shows that with probability at least $1 - α$ , this test will never stop under the null. Let $τ_{θ}$ denote the stopping time when the data are drawn from $P_{θ}$ , which is finite only if we stop and reject the null.

Theorem 9.

The running MLE LRT has type I error at most $α$ , meaning that $sup_{θ^{*} \in Θ_{0}} P_{θ^{*}} (τ_{θ^{*}} < \infty) \leq α$ .

The proof involves the simple observation that under the null, $M_{t}$ is upper bounded by a nonnegative martingale $L_{t}$ with initial value one. Specifically, define the (oracle) process starting with $L_{0} ≔ 1$ and

L_{t} ≔ \frac{\prod_{i = 1}^{t} p_{{\hat{θ}}_{i - 1}} (Y_{i})}{\prod_{i = 1}^{t} p_{θ^{*}} (Y_{i})} \equiv L_{t - 1} \frac{p_{{\hat{θ}}_{t - 1}} (Y_{t})}{p_{θ^{*}} (Y_{t})} .

[18]

Note that under the null, we have $M_{t} \leq L_{t}$ because ${\hat{θ}}_{0, t}$ and $θ^{*}$ both belong to $Θ_{0}$ , but the former maximizes the null likelihood (denominator). Further, it is easy to verify that $L_{t}$ is a nonnegative martingale with respect to the natural filtration $F_{t} = σ (Y_{1}, \dots, Y_{t})$ . Indeed,

\begin{array}{l} E_{θ^{*}} [L_{t} | F_{t - 1}] & = E_{θ^{*}} [\frac{\prod_{i = 1}^{t} p_{{\hat{θ}}_{i - 1}} (Y_{i})}{\prod_{i = 1}^{t} p_{θ^{*}} (Y_{i})} | F_{t - 1}] \\ = L_{t - 1} E_{θ^{*}} [\frac{p_{{\hat{θ}}_{t - 1}} (Y_{t})}{p_{θ^{*}} (Y_{t})} | F_{t - 1}] = L_{t - 1}, \end{array}

where the last equality mimics Eq. 6. To complete the proof, we note that the type I error of the running MLE LRT is simply bounded as

\begin{array}{l} P_{θ^{*}} (\exists t \in N : M_{t} > 1 / α) & \leq P_{θ^{*}} (\exists t \in N : L_{t} > 1 / α) \\ \overset{(i)}{\leq} E_{θ^{*}} [L_{0}] \cdot α = α, \end{array}

where step $(i)$ follows by Ville’s inequality (38, 39), a time-uniform version of Markov’s inequality for nonnegative supermartingales.

Naturally, this test does not have to start at $t = 1$ when only one sample is available, meaning that we can set $M_{0} = M_{1} = \dots = M_{t_{0}} = 1$ for the first $t_{0}$ steps and then begin the updates. Similarly, $t$ need not represent the time at which the $t$ th sample was observed; it can just represent the $t$ th recalculation of the estimators (there may be multiple samples observed between $t - 1$ and $t$ ).

Anytime-Valid $P$ Values.

We can also get a $P$ value that is uniformly valid over time. Specifically, both $p_{t} = 1 / M_{t}$ and ${\bar{p}}_{t} = min_{s \leq t} 1 / M_{s}$ may serve as $P$ values.

Theorem 10.

For any random time $T$ , not necessarily a stopping time, $sup_{θ^{*} \in Θ_{0}} P_{θ^{*}} ({\bar{p}}_{T} \leq x) \leq x$ for $x \in [0,1]$ .

The aforementioned property is equivalent to the statement that under the null $P (\exists t \in N : {\bar{p}}_{t} \leq α) \leq α$ , and its proof follows by substitution immediately from the previous argument. Naturally ${\bar{p}}_{t} \leq p_{t}$ , but from the perspective of designing a level $α$ test they are equivalent, because the first time that $p_{t}$ falls below $α$ is also the first time that ${\bar{p}}_{t}$ falls below $α$ . The term “anytime-valid” is used because, unlike typical $P$ values, these are valid at (data-dependent) stopping times or even random times chosen post hoc. Hence, inference is robust to “peeking,” optional stopping, and optional continuation of experiments. Such anytime $P$ values can be inverted to yield confidence sequences, as described below.

Confidence Sequences.

A confidence sequence for $θ^{*}$ is an infinite sequence of confidence intervals that are all simultaneously valid. Such confidence intervals are valid at arbitrary stopping times and also at other random data-dependent times that are chosen post hoc. In the same setup as above, but without requiring a null set $Θ_{0}$ , define the running MLE likelihood-ratio process

R_{t} (θ) ≔ \frac{\prod_{i = 1}^{t} p_{{\hat{θ}}_{1, i - 1}} (Y_{i})}{\prod_{i = 1}^{t} p_{θ} (Y_{i})} .

Then, a confidence sequence for $θ^{*}$ is given by

C_{t} ≔ {θ : R_{t} (θ) \leq 1 / α} .

In fact, the running intersection ${\bar{C}}_{t} = ⋂_{s \leq t} C_{s}$ is also a confidence sequence; note that ${\bar{C}}_{t} \subseteq C_{t}$ .

Theorem 11.

$C_{t}$ and ${\bar{C}}_{t}$ are confidence sequences for $θ^{*}$ , meaning that $P_{θ^{*}} (\exists t \in N : θ^{*} \notin {\bar{C}}_{t}) \leq α$ . Equivalently, $P_{θ^{*}} (θ^{*} \in C_{τ}) \geq 1 - α$ for any stopping time $τ$ , and also $P_{θ^{*}} (θ^{*} \in C_{T}) \geq 1 - α$ for any arbitrary random time $T$ .

The proof is straightforward. First, note that $θ^{*} \notin {\bar{C}}_{t}$ for some $t$ if and only if $θ^{*} \notin C_{t}$ for some $t$ . Hence,

P_{θ^{*}} (\exists t \in N : θ^{*} \notin C_{t}) = P_{θ^{*}} (\exists t \in N : R_{t} (θ^{*}) > 1 / α) \leq α,

where the last step uses, as before, Ville’s inequality for the martingale $R_{t} (θ^{*}) \equiv L_{t}$ from Eq. 18. The fact that the other two statements in Theorem 11 are equivalent to the first one follows from recent work (40).

Duality.

It is worth remarking that confidence sequences are dual to anytime $P$ values, just like confidence intervals are dual to standard $P$ values, in the sense that a $(1 - α)$ confidence sequence can be formed by inverting a family of level $α$ sequential tests (each testing a different point in the space), and a level $α$ sequential test for a composite null set $Θ_{0}$ can be obtained by checking whether the $(1 - α)$ confidence sequence intersects the null set $Θ_{0}$ .

In fact, our constructions of $p_{t}$ and $C_{t}$ (without running minimum/intersection) obey the same property: $p_{t} < α$ only if $C_{t} \cap Θ_{0} = \emptyset$ , and the reverse implication follows if $Θ_{0}$ is closed. To see the forward implication, assume that there exists some element $θ^{'} \in C_{t} \cap Θ_{0}$ . Since $θ^{'} \in C_{t}$ , we have $R_{t} (θ^{'}) \leq 1 / α$ . Since $θ^{'} \in Θ_{0}$ , we have $inf_{θ^{*} \in Θ_{0}} R_{t} (θ^{*}) \leq 1 / α$ . This last condition can be restated as $M_{t} \leq 1 / α$ , which means that $p_{t} \geq α$ .

It is also possible to obtain an anytime $P$ value from a family of confidence sequences at different $α$ , by defining $p_{t}$ as the smallest $α$ for which $C_{t} \equiv C_{t} (α)$ intersects $Θ_{0}$ .

Extensions.

All of the extensions from Section 5 extend immediately to the sequential setting. One can handle nuisance parameters using profile likelihoods; this for example leads to sequential $t$ tests (for the Gaussian family, with the variance as a nuisance parameter), which also yield confidence sequences for the Gaussian mean with unknown variance. Non-i.i.d. data, such as in sampling without replacement, can be handled using conditional likelihoods, and robustness can be increased with powered likelihoods. In these situations, the corresponding underlying process $L_{t}$ may not be a martingale, but a supermartingale. Also, as before, we may also use upper bounds on the maximum likelihood at each step (perhaps minimizing convex relaxations of the negative log-likelihood) or smooth the likelihood if needed.

Such confidence sequences have been developed under very general nonparametric, multivariate, matrix, and continuous-time settings using generalizations of the aforementioned supermartingale technique (39–41). The connections between anytime-valid $P$ values, $e$ values, safe tests, peeking, confidence sequences, and the properties of optional stopping and continuation have been explored recently (35, 40, 42, 43). The connection to the present work is that when run sequentially, our universal (MLE or RIPR) split LRT yields an anytime-valid $P$ value, an $e$ value, and a safe test, which can be inverted to form universal confidence sequences and are valid under optional stopping and continuation, and these are simply because the underlying process of interest is bounded by a nonnegative (super)martingale. This line of research began over 50 y ago by Darling and Robbins (37), Robbins (44), Robbins and Siegmund (45), and Lai (46, 47). In fact, for testing point nulls, the running MLE (or nonanticipating) martingale was suggested in passing by Wald (ref. 48, equation 10:10) and analyzed in depth by refs. 45 and 49 where connections were shown to the mixture sequential probability-ratio test. These ideas have been utilized in changepoint detection for both point nulls (50) and composite nulls (51).

8. Conclusion

Inference based on the split likelihood-ratio statistic (and variants) leads to simple tests and confidence sets with finite-sample guarantees. Our methods are most useful in problems where standard asymptotic methods are difficult/impossible to apply, such as complex composite null testing problems or nonparametric confidence sets. Going forward, we intend to run simulations in a variety of models to study the power of the test and the size of the confidence sets and study their optimality in special cases. We do not expect the test to be rate optimal in all cases, but it might have analogous properties to the generalized LRT. It would also be interesting to extend these methods (like the profile-likelihood variant) to semiparametric problems where there are a finite-dimensional parameter of interest and an infinite-dimensional nuisance parameter.

9. Data Availability

Due to space constraints, we have relegated technical details of the proofs of Theorems 5 and 6 to SI Appendix. There are no additional data, protocols, or code associated with this paper.

Supplementary Material

Supplementary File

pnas.1922664117.sapp.pdf^{(244.1KB, pdf)}

Acknowledgments

We thank Caroline Uhler and Arun K. Kuchibhotla for references to open problems in shape-constrained inference and Ryan Tibshirani for suggesting the relaxed-likelihood idea. We are grateful to Bin Yu, Hue Wang and Marco Molinaro for helpful feedback which motivated parts of Section 6. We thank the reviewers and Dennis Boos for helpful suggestions and Len Stefanski for pointing us to work on smoothed likelihoods.

Footnotes

Competing interest statement: L.W. and R.T. are coauthors on a manuscript written in 2015 and published in 2018.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1922664117/-/DCSupplemental.

References

1.Drton M., Likelihood ratio tests and singularities. Ann. Stat. 37, 979–1012 (2009). [Google Scholar]
2.Redner R., Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Stat. 9, 225–228 (1981). [Google Scholar]
3.Dacunha-Castelle D., Gassiat E., Testing in locally conic models, and application to mixture models. ESAIM Probab. Stat. 1, 285–317 (1997). [Google Scholar]
4.Chen J., Li P., Hypothesis test for normal mixture models: The EM approach. Ann. Stat. 37, 2523–2542 (2009). [Google Scholar]
5.McLachlan G. J., On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 36, 318–324 (1987). [Google Scholar]
6.Chakravarti P., Balakrishnan S., Wasserman L., Gaussian mixture clustering using relative tests of fit. arXiv:1910.02566 (7 October 2019).
7.Van der Vaart A. W., Asymptotic Statistics (Cambridge University Press, 2000), vol. 3. [Google Scholar]
8.Wong W. H., Shen X., Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann. Stat. 23, 339–362 (1995). [Google Scholar]
9.Balakrishnan S., Wainwright M. J., Yu B., Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat. 45, 77–120 (2017). [Google Scholar]
10.Xu J., Hsu D. J., Maleki A., “Global analysis of expectation maximization for mixtures of two Gaussians” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 2676–2684. [Google Scholar]
11.Jin C., Zhang Y., Balakrishnan S., Wainwright M. J., Jordan M. I., “Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 4116–4124. [Google Scholar]
12.Shen X., Wong W. H., Convergence rate of sieve estimates. Ann. Stat. 22, 580–615 (1994). [Google Scholar]
13.Cule M., Samworth R., Stewart M., Maximum likelihood estimation of a multi-dimensional log-concave density. J. R. Stat. Soc. B 72, 545–607 (2010). [Google Scholar]
14.Axelrod B., Diakonikolas I., Stewart A., Sidiropoulos A., Valiant G., “A polynomial time algorithm for log-concave maximum likelihood via locally exponential families” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2019), vol. 32, pp. 7723–7735. [Google Scholar]
15.Silverman B. W., Density Estimation for Statistics and Data Analysis (Routledge, 2018). [Google Scholar]
16.Karlin S., Rinott Y., Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multivariate Anal. 10, 467–498 (1980). [Google Scholar]
17.Lauritzen S., Uhler C., Zwiernik P., Maximum likelihood estimation in Gaussian models under total positivity. Ann. Stat. 47, 1835–1863 (2019). [Google Scholar]
18.Lauritzen S., Uhler C., Zwiernik P., Total positivity in structured binary distributions. arXiv:1905.00516 (1 May 2019).
19.Robeva E., Sturmfels B., Tran N., Uhler C., Maximum likelihood estimation for totally positive log-concave densities. arXiv:1806.10120 (26 June 2018).
20.Guo F., Richardson T. S., On testing marginal versus conditional independence. arXiv:1906.01850 (5 June 2019).
21.Meinshausen N., Meier L., Bühlmann P., P-values for high-dimensional regression. J. Am. Stat. Assoc. 104, 1671–1681 (2009). [Google Scholar]
22.Amini A. A., Levina E., On semidefinite relaxations for the block model. Ann. Stat. 46, 149–179 (2018). [Google Scholar]
23.Dahl J., Fleury B. H., Vandenberghe L., “Approximate maximum-likelihood estimation using semidefinite programming“ in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, 2003), vol. 6, pp. VI–721. [Google Scholar]
24.Bandeira A. S., Boumal N., Singer A., Tightness of the maximum likelihood semidefinite relaxation for angular synchronization. Math. Program. 163, 145–167 (2017). [Google Scholar]
25.Royall R., Tsou T. S., Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions. J. R. Stat. Soc. B 65, 391–404 (2003). [Google Scholar]
26.Grünwald P., “The safe Bayesian” in International Conference on Algorithmic Learning Theory (Springer, Berlin, Germany, 2012), pp. 169–183. [Google Scholar]
27.Holmes C., Walker S., Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104, 497–503 (2017). [Google Scholar]
28.Grünwald P., Van Ommen T., Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12, 1069–1103 (2017). [Google Scholar]
29.Miller J. W., Dunson D. B., Robust Bayesian inference via coarsening. J. Am. Stat. Assoc. 114, 1113–1125 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Seo B., Lindsay B. G., A universally consistent modification of maximum likelihood. Stat. Sin. 23, 467–487 (2013). [Google Scholar]
31.Li Q. J., “Estimation of mixture models,” PhD thesis, Yale University, New Haven, CT (1999). [Google Scholar]
32.Hoff P. D., Nonparametric estimation of convex models via mixtures. Ann. Stat. 31, 174–200 (2003). [Google Scholar]
33.Brunk H., Franck W., Hanson D., Hogg R., Maximum likelihood estimation of the distributions of two stochastically ordered random variables. J. Am. Stat. Assoc. 61, 1067–1080 (1966). [Google Scholar]
34.Dykstra R. L., Feltz C. J., Nonparametric maximum likelihood estimation of survival functions with a general stochastic ordering and its dual. Biometrika 76, 331–341 (1989). [Google Scholar]
35.Grunwald P., de Heide R., Koolen W., Safe testing. arXiv:1906.07801 (18 June 2019).
36.Wald A., Sequential tests of statistical hypotheses. Ann. Math. Stat. 16, 117–186 (1945). [Google Scholar]
37.Darling D., Robbins H., Confidence sequences for mean, variance, and median. Proc. Natl. Acad. Sci. U.S.A. 58, 66–68 (1967). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Ville J., Étude Critique de la Notion de Collectif (Gauthier-Villars, Paris, France, 1939). [Google Scholar]
39.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Time-uniform Chernoff bounds via nonnegative supermartingales. Probab. Surv. 17, 257–317 (2020). [Google Scholar]
40.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Uniform, nonparametric, non-asymptotic confidence sequences. arXiv:1810.08240 (18 October 2018).
41.Howard S. R., Ramdas A., Sequential estimation of quantiles with applications to A/B-testing and best-arm identification. arXiv:1906.09712 (24 June 2019).
42.Johari R., Koomen P., Pekelis L., Walsh D., Peeking at A/B Tests: Why It Matters, and What to Do about It (ACM Press, 2017), pp. 1517–1525. [Google Scholar]
43.Shafer G., Shen A., Vereshchagin N., Vovk V., Test martingales, Bayes factors and p-values. Stat. Sci. 26, 84–101 (2011). [Google Scholar]
44.Robbins H., Statistical methods related to the law of the iterated logarithm. Ann. Math. Stat. 41, 1397–1409 (1970). [Google Scholar]
45.Robbins H., Siegmund D., “A class of stopping rules for testing parametric hypotheses” in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, CA, 1970–1972), vol. 4, pp. 37–41. (1972). [Google Scholar]
46.Lai T. L., On confidence sequences. Ann. Stat. 4, 265–280 (1976). [Google Scholar]
47.Lai T. L., Boundary crossing probabilities for sample sums and confidence sequences. Ann. Probab. 4, 299–312 (1976). [Google Scholar]
48.Wald A., Sequential Analysis (Courier Corporation, 1947). [Google Scholar]
49.Robbins H., Siegmund D., The expected sample size of some tests of power one. Ann. Stat. 2, 415–436 (1974). [Google Scholar]
50.Lorden G., Pollak M., Nonanticipating estimation applied to sequential analysis and changepoint detection. Ann. Stat. 33, 1422–1454 (2005). [Google Scholar]
51.Vexler A., Martingale type statistics applied to change points detection. Commun. Stat. Theor. Methods 37, 1207–1224 (2008). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1922664117.sapp.pdf^{(244.1KB, pdf)}

[r1] 1.Drton M., Likelihood ratio tests and singularities. Ann. Stat. 37, 979–1012 (2009). [Google Scholar]

[r2] 2.Redner R., Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Stat. 9, 225–228 (1981). [Google Scholar]

[r3] 3.Dacunha-Castelle D., Gassiat E., Testing in locally conic models, and application to mixture models. ESAIM Probab. Stat. 1, 285–317 (1997). [Google Scholar]

[r4] 4.Chen J., Li P., Hypothesis test for normal mixture models: The EM approach. Ann. Stat. 37, 2523–2542 (2009). [Google Scholar]

[r5] 5.McLachlan G. J., On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 36, 318–324 (1987). [Google Scholar]

[r6] 6.Chakravarti P., Balakrishnan S., Wasserman L., Gaussian mixture clustering using relative tests of fit. arXiv:1910.02566 (7 October 2019).

[r7] 7.Van der Vaart A. W., Asymptotic Statistics (Cambridge University Press, 2000), vol. 3. [Google Scholar]

[r8] 8.Wong W. H., Shen X., Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann. Stat. 23, 339–362 (1995). [Google Scholar]

[r9] 9.Balakrishnan S., Wainwright M. J., Yu B., Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat. 45, 77–120 (2017). [Google Scholar]

[r10] 10.Xu J., Hsu D. J., Maleki A., “Global analysis of expectation maximization for mixtures of two Gaussians” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 2676–2684. [Google Scholar]

[r11] 11.Jin C., Zhang Y., Balakrishnan S., Wainwright M. J., Jordan M. I., “Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2016), vol. 29, pp. 4116–4124. [Google Scholar]

[r12] 12.Shen X., Wong W. H., Convergence rate of sieve estimates. Ann. Stat. 22, 580–615 (1994). [Google Scholar]

[r13] 13.Cule M., Samworth R., Stewart M., Maximum likelihood estimation of a multi-dimensional log-concave density. J. R. Stat. Soc. B 72, 545–607 (2010). [Google Scholar]

[r14] 14.Axelrod B., Diakonikolas I., Stewart A., Sidiropoulos A., Valiant G., “A polynomial time algorithm for log-concave maximum likelihood via locally exponential families” in Advances in Neural Information Processing Systems (Curran Associates, Inc., 2019), vol. 32, pp. 7723–7735. [Google Scholar]

[r15] 15.Silverman B. W., Density Estimation for Statistics and Data Analysis (Routledge, 2018). [Google Scholar]

[r16] 16.Karlin S., Rinott Y., Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multivariate Anal. 10, 467–498 (1980). [Google Scholar]

[r17] 17.Lauritzen S., Uhler C., Zwiernik P., Maximum likelihood estimation in Gaussian models under total positivity. Ann. Stat. 47, 1835–1863 (2019). [Google Scholar]

[r18] 18.Lauritzen S., Uhler C., Zwiernik P., Total positivity in structured binary distributions. arXiv:1905.00516 (1 May 2019).

[r19] 19.Robeva E., Sturmfels B., Tran N., Uhler C., Maximum likelihood estimation for totally positive log-concave densities. arXiv:1806.10120 (26 June 2018).

[r20] 20.Guo F., Richardson T. S., On testing marginal versus conditional independence. arXiv:1906.01850 (5 June 2019).

[r21] 21.Meinshausen N., Meier L., Bühlmann P., P-values for high-dimensional regression. J. Am. Stat. Assoc. 104, 1671–1681 (2009). [Google Scholar]

[r22] 22.Amini A. A., Levina E., On semidefinite relaxations for the block model. Ann. Stat. 46, 149–179 (2018). [Google Scholar]

[r23] 23.Dahl J., Fleury B. H., Vandenberghe L., “Approximate maximum-likelihood estimation using semidefinite programming“ in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE, 2003), vol. 6, pp. VI–721. [Google Scholar]

[r24] 24.Bandeira A. S., Boumal N., Singer A., Tightness of the maximum likelihood semidefinite relaxation for angular synchronization. Math. Program. 163, 145–167 (2017). [Google Scholar]

[r25] 25.Royall R., Tsou T. S., Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions. J. R. Stat. Soc. B 65, 391–404 (2003). [Google Scholar]

[r26] 26.Grünwald P., “The safe Bayesian” in International Conference on Algorithmic Learning Theory (Springer, Berlin, Germany, 2012), pp. 169–183. [Google Scholar]

[r27] 27.Holmes C., Walker S., Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104, 497–503 (2017). [Google Scholar]

[r28] 28.Grünwald P., Van Ommen T., Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12, 1069–1103 (2017). [Google Scholar]

[r29] 29.Miller J. W., Dunson D. B., Robust Bayesian inference via coarsening. J. Am. Stat. Assoc. 114, 1113–1125 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30] 30.Seo B., Lindsay B. G., A universally consistent modification of maximum likelihood. Stat. Sin. 23, 467–487 (2013). [Google Scholar]

[r31] 31.Li Q. J., “Estimation of mixture models,” PhD thesis, Yale University, New Haven, CT (1999). [Google Scholar]

[r32] 32.Hoff P. D., Nonparametric estimation of convex models via mixtures. Ann. Stat. 31, 174–200 (2003). [Google Scholar]

[r33] 33.Brunk H., Franck W., Hanson D., Hogg R., Maximum likelihood estimation of the distributions of two stochastically ordered random variables. J. Am. Stat. Assoc. 61, 1067–1080 (1966). [Google Scholar]

[r34] 34.Dykstra R. L., Feltz C. J., Nonparametric maximum likelihood estimation of survival functions with a general stochastic ordering and its dual. Biometrika 76, 331–341 (1989). [Google Scholar]

[r35] 35.Grunwald P., de Heide R., Koolen W., Safe testing. arXiv:1906.07801 (18 June 2019).

[r36] 36.Wald A., Sequential tests of statistical hypotheses. Ann. Math. Stat. 16, 117–186 (1945). [Google Scholar]

[r37] 37.Darling D., Robbins H., Confidence sequences for mean, variance, and median. Proc. Natl. Acad. Sci. U.S.A. 58, 66–68 (1967). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r38] 38.Ville J., Étude Critique de la Notion de Collectif (Gauthier-Villars, Paris, France, 1939). [Google Scholar]

[r39] 39.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Time-uniform Chernoff bounds via nonnegative supermartingales. Probab. Surv. 17, 257–317 (2020). [Google Scholar]

[r40] 40.Howard S. R., Ramdas A., McAuliffe J., Sekhon J., Uniform, nonparametric, non-asymptotic confidence sequences. arXiv:1810.08240 (18 October 2018).

[r41] 41.Howard S. R., Ramdas A., Sequential estimation of quantiles with applications to A/B-testing and best-arm identification. arXiv:1906.09712 (24 June 2019).

[r42] 42.Johari R., Koomen P., Pekelis L., Walsh D., Peeking at A/B Tests: Why It Matters, and What to Do about It (ACM Press, 2017), pp. 1517–1525. [Google Scholar]

[r43] 43.Shafer G., Shen A., Vereshchagin N., Vovk V., Test martingales, Bayes factors and p-values. Stat. Sci. 26, 84–101 (2011). [Google Scholar]

[r44] 44.Robbins H., Statistical methods related to the law of the iterated logarithm. Ann. Math. Stat. 41, 1397–1409 (1970). [Google Scholar]

[r45] 45.Robbins H., Siegmund D., “A class of stopping rules for testing parametric hypotheses” in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, CA, 1970–1972), vol. 4, pp. 37–41. (1972). [Google Scholar]

[r46] 46.Lai T. L., On confidence sequences. Ann. Stat. 4, 265–280 (1976). [Google Scholar]

[r47] 47.Lai T. L., Boundary crossing probabilities for sample sums and confidence sequences. Ann. Probab. 4, 299–312 (1976). [Google Scholar]

[r48] 48.Wald A., Sequential Analysis (Courier Corporation, 1947). [Google Scholar]

[r49] 49.Robbins H., Siegmund D., The expected sample size of some tests of power one. Ann. Stat. 2, 415–436 (1974). [Google Scholar]

[r50] 50.Lorden G., Pollak M., Nonanticipating estimation applied to sequential analysis and changepoint detection. Ann. Stat. 33, 1422–1454 (2005). [Google Scholar]

[r51] 51.Vexler A., Martingale type statistics applied to change points detection. Commun. Stat. Theor. Methods 37, 1207–1224 (2008). [Google Scholar]

PERMALINK

Universal inference

Larry Wasserman

Aaditya Ramdas

Sivaraman Balakrishnan

Significance

Abstract

1. Universal Inference

A Universal Confidence Set.

Theorem 1.

Remark 2:

A Universal Hypothesis Test.

Theorem 3.

Remark 4.

Perspective: Poor Man’s Chernoff Bound.

Perspective: In-Sample versus Out-of-Sample Likelihood.

2. Sanity Check: Regular Models

Theorem 5.

Analyzing the Nonparametric Split LRT.

Theorem 6.

Comparing LRT to Split LRT for the Multivariate Normal Case.

3. Examples

Mixture Models.

Fig. 1.

Model Selection Using Sieves.

Nonparametric Example: Shape-Constrained Inference.

Positive Dependence (Multivariate Total Positivity of Order 2).

Independence versus Conditional Independence.

Cross-Fitting Can Beat Splitting: Uniform Distribution.

4. Derandomization

5. Extensions

Profile Likelihood and Nuisance Parameters.

Upper Bounding the Null Maximum Likelihood.

Robustness via Powered Likelihoods.

Smoothed Likelihoods.

Conditional Likelihood for Non-i.i.d. Data.

6. Misspecification and Convex Model Classes

Model Misspecification.

Proposition 7.

An Alternate Split LRT (RIPR Split LRT).

Proposition 8.

7. Anytime P Values and Confidence Sequences

The Running MLE Sequential LRT.

Theorem 9.

Anytime-Valid P Values.

Theorem 10.

Confidence Sequences.

Theorem 11.

Duality.

Extensions.

8. Conclusion

9. Data Availability

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

7. Anytime $P$ Values and Confidence Sequences

Anytime-Valid $P$ Values.