Conservative hypothesis tests and confidence intervals using importance sampling

Matthew T Harrison

doi:10.1093/biomet/asr079

. 2012 Feb 7;99(1):57–69. doi: 10.1093/biomet/asr079

Conservative hypothesis tests and confidence intervals using importance sampling

Matthew T Harrison ¹

PMCID: PMC3412608 PMID: 23049134

Summary

Importance sampling is a common technique for Monte Carlo approximation, including that of p-values. Here it is shown that a simple correction of the usual importance sampling p-values provides valid p-values, meaning that a hypothesis test created by rejecting the null hypothesis when the p-value is at most α will also have a Type I error rate of at most α. This correction uses the importance weight of the original observation, which gives valuable diagnostic information under the null hypothesis. Using the corrected p-values can be crucial for multiple testing and also in problems where evaluating the accuracy of importance sampling approximations is difficult. Inverting the corrected p-values provides a useful way to create Monte Carlo confidence intervals that maintain the nominal significance level and use only a single Monte Carlo sample.

Some key words: Exact inference, Monte Carlo simulation, Multiple testing, p-value, Rasch model

1. Introduction

Importance sampling is a common technique for Monte Carlo approximation. Besides its use in situations where efficient direct-sampling algorithms are unavailable, it can be used to accelerate the approximation of small p-values as needed for multiple hypothesis testing. Importance sampling can also be used for Monte Carlo approximation of confidence intervals using a single Monte Carlo sample by inverting a family of hypothesis tests. The practicality of these two uses of importance sampling, accelerated multiple testing and Monte Carlo approximation of confidence intervals, would seem to be limited by the fact that each places strong requirements on the importance sampling procedure. Multiple testing controls can be sensitive to small absolute errors but large relative errors in small p-values, which would seem to demand either excessively large Monte Carlo sample sizes or unrealistically accurate importance sampling proposal distributions in order to reduce absolute errors to tolerable levels. The confidence interval procedure uses a single proposal distribution to approximate probabilities under a large family of target distributions, which again would seem to demand large sample sizes in order to overcome the high variability that is frequently encountered when the proposal distribution is not tailored to a specific target distribution. Nevertheless, as shown here, simple corrections of the usual p-value approximations can be used to overcome these difficulties, making importance sampling a practical choice for accelerated multiple testing and for constructing Monte Carlo confidence intervals.

The p-value corrections that we introduce require negligible additional computation and still converge to the target p-value, but are valid p-values, meaning the probability of rejecting a null hypothesis does not exceed α for any specified level α. Hypothesis tests and confidence intervals constructed from the corrected p-value approximations are guaranteed to be conservative, regardless of the Monte Carlo sample size, while still behaving like the target tests and intervals for sufficiently large Monte Carlo sample sizes. The combination of being both conservative and consistent turns out to be crucial in many applications where the importance sampling variability cannot be adequately controlled with practical amounts of computing, including multiple testing, confidence intervals and any hypothesis testing situation where the true variability of the importance sampling algorithm is either large or unknown.

Let X denote the observed data. Assume that the null hypothesis specifies a known probability distribution P for X. For a specified test statistic t(X), the goal is to compute the p-value p(X) defined by

p (x) = pr {t (X) ⩾ t (x)},

(1)

where the probability is computed under the null hypothesis. Importance sampling can be used to approximate p(X) if the p-value cannot be determined analytically. Let Y = (Y₁, …, Y_n) be a random sample from the proposal distribution Q whose support includes the support of the target distribution P. Then p(X) can be approximated with either of

\begin{array}{l} p_{1} (X, Y) = \frac{\sum_{i = 1}^{n} w (Y_{i}) 𝟙 {t (Y_{i}) ⩾ t (X)}}{n}, & p_{2} (X, Y) = \frac{\sum_{i = 1}^{n} w (Y_{i}) 𝟙 {t (Y_{i}) ⩾ t (X)}}{\sum_{j = 1}^{n} w (Y_{j})}, \end{array}

where 𝟙 is the indicator function, and the importance weights are defined to be the ratio of probability mass functions in the discrete case, i.e., w(x) = P(x)/Q(x), and the ratio of densities in the continuous case. Each p_c is a consistent approximation of p(X) as the Monte Carlo sample size n increases. The approximation p₁ is unbiased and is especially useful for accelerating the approximation of extremely small p-values. The approximation p₂ can be evaluated even if the importance weights are known only up to a constant of proportionality. The reader is referred to Liu (2001) for details and references concerning importance sampling and to Lehmann & Romano (2005) for hypothesis testing.

Here we propose the following simple corrections of p_c that use the importance weight of the original observation, namely

\begin{array}{l} {\tilde{p}}_{1} (X, Y) = \frac{w (X) + \sum_{i = 1}^{n} w (Y_{i}) 𝟙 {t (Y_{i}) ⩾ t (X)}}{1 + n}, \\ {\tilde{p}}_{2} (X, Y) = \frac{w (X) + \sum_{i = 1}^{n} w (Y_{i}) 𝟙 {t (Y_{i}) ⩾ t (X)}}{w (X) + \sum_{j = 1}^{n} w (Y_{j})} . \end{array}

(2)

We show that the corrected p-value approximations, while clearly still consistent approximations of the target p-value p(X), are also themselves valid, meaning

\begin{array}{l} \begin{array}{l} pr {{\tilde{p}}_{c} (X, Y) ⩽ α} ⩽ α & (0 ⩽ α ⩽ 1; n = 0, 1, \dots; c = 1, 2), \end{array} \end{array}

under the null hypothesis, where the probability is with respect to the joint distribution of data and Monte Carlo sample.

For the special case of direct sampling, i.e., Q = P and w ≡ 1, the validity of ${\tilde{p}}_{c} (X, Y) = {(1 + n)}^{- 1} [1 + \sum_{i = 1}^{n} 𝟙 {t (Y_{i}) ⩾ t (X)}]$ is well known and is often recommended over p_c for use as a p-value (Davison & Hinkley, 1997, Ch. 4). This paper was motivated by trying to extend the special case of direct sampling to the more general case of importance sampling. For Markov chain Monte Carlo approaches, Besag & Clifford (1989) demonstrate how to generate valid p-value approximations using techniques that are unrelated to the ones here. It would be interesting to know whether the two approaches could be combined so that the importance samples could also be generated by a Markov chain.

2. Main results

The main result is that each p̃_c is a valid p-value. We generalize the introductory discussion in two directions. First, we allow arbitrary distributions, so the importance weights become the Radon–Nikodym derivative dP/dQ. In the discrete case, this simplifies to the ratio of probability mass functions and, in the continuous case, to the ratio of probability density functions. Secondly, we allow the choice of test statistic to depend on (X, Y₁, …, Y_n) as long as the choice is invariant to permutations of (X, Y₁, …, Y_n). We express this mathematically by writing the test statistic, t, as a function of two arguments: the first is the same as before, but the second argument takes the entire sequence (X, Y₁, …, Y_n), although we require that t is invariant to permutations in the second argument. For example, we may want to transform the sequence (X, Y₁, …, Y_n) in some way, either before or after applying a test statistic to the individual entries. As long as the transformation procedure is permutation invariant, such as centring and scaling, or converting to ranks, the main result still holds. Transformations are often desirable in multiple testing contexts for improving balance (Westfall & Young, 1993).

We begin with the precise notation and assumptions for the main result. Let P and Q denote probability distributions defined on the same measurable space (S, 𝒮) with P ≪ Q, meaning that sets with positive P probability also have positive Q probability, and let w denote a fixed, nonnegative version of the Radon–Nikodym derivative dP/dQ. The set 𝒨 denotes all (n + 1)! permutations π = (π₀, …, π_n) of (0, …, n). For π ∊ 𝒨 and z = (z₀, …, z_n), we define z^(π) = (z_π₀, …, z_{π_n}). Assume that t : S × Sⁿ⁺¹ ↦ [−∞, ∞] is a measurable function satisfying t(a, z) = t(a, z^(π)) for all a ∊ S, z ∊ Sⁿ⁺¹ and π ∊ 𝒨. For z ∊ Sⁿ⁺¹, define

\begin{array}{l} {\tilde{p}}_{1} (z) = \frac{\sum_{i = 0}^{n} w (z_{i}) 𝟙 {t (z_{i}, z) ⩾ t (z_{0}, z)}}{n + 1}, & {\tilde{p}}_{2} (z) = \frac{\sum_{i = 0}^{n} w (z_{i}) 𝟙 {t (z_{i}, z) ⩾ t (z_{0}, z)}}{\sum_{j = 0}^{n} w (z_{j})}, \end{array}

(3)

where we use the convention 0/0 = 0. Let X have distribution P and Y₀, Y₁, …, Y_n be independent and identically distributed with common distribution Q, independent of X. For notational convenience, define Z = (Z₀, Z₁, …, Z_n) by Z₀ = X, Z₁ = Y₁, …, Z_n = Y_n, so that the corrected p-values defined in (2) can be written as p̃_c(Z) from (3). Then, we obtain the following theorem.

Theorem 1. For each c = 1, 2, we have pr{p̃_c(Z) ⩽ α} ⩽ α for all α ∊ [0, 1].

Proof. See the Appendix.

The theorem does not require any special relationships among P, Q, t or n. For example, in parametric settings, Q does not need to be in the same model class as the null and/or alternative hypotheses. Validity of the corrected p-values is ensured even for unusual cases such as n = 0 or importance weights with infinite variance. We discuss some practical considerations for choosing Q in the next section; presumably, P, t and n are dictated by the application.

The theorem continues to hold if each Y_k is chosen from a different Q, say Q_k, as long as the sequence Q₀, Q₁, …, Q_n is itself independent and identically distributed from some distribution over proposal distributions. The kth importance weight is now a fixed version of the Radon–Nikodym derivative d P/d Q_k. This generalization can be useful in practice when each Y_k is generated hierarchically by first choosing an auxiliary variable V_k and then, given V_k, choosing Y_k from some distribution Q^V_k that depends on V_k. If the marginal distribution Q of Y_k is not available, one can use importance weights based on the conditional distribution Q^V_k of Y_k given V_k. This is equivalent to using random proposal distributions as described above. The drawback of combining this approach with the p-value corrections is that the importance weight of the original observation is evaluated using dP/d Q₀, where Q₀ is a randomly chosen proposal distribution. This further increases the Monte Carlo variation inherent in the p-value approximations. Generalizing the theorem to allow for random proposals requires no additional work. If ν is the joint distribution of each (Q_k, Y_k) and ν_Q is the marginal distribution of each Q_k, simply apply the theorem with ν_Q × P in place of P, ν in place of Q, (Q₀, X) in place of X and (Q_k, Y_k) in place of Y_k. If one uses a regular conditional distribution to define the importance weight {d(ν_Q × P)/dν}(Q_k, x), then it will simplify to a version of (d Q_k/dP)(x).

The theorem concerns p-values for the simple hypothesis test that X ∼ P, but valid p-values for simple tests can be used to generate valid p-values for composite tests by maximizing over all distributions in the null hypothesis. Consider a collection of distributions, {P_θ: θ ∊ Θ}, corresponding test statistics, {t_θ: θ ∊ Θ}, and corresponding importance weights, w_θ= d P_θ/dQ, where the proposal distribution, Q, satisfies P_θ ≪ Q for all θ ∊ Θ. For each θ, define the corrected p-value approximations p̃_c(θ, z) as in (3) using t_θ and w_θ. The theorem shows that each p̃_c (θ₀, Z) is a valid p-value for the simple null hypothesis that X ∼ P_θ for θ = θ₀. An immediate corollary is that sup_θ∊Θ₀ p̃_c(θ, Z) is a valid p-value for the composite null hypothesis that X ∼ P_θ for some θ ∊ Θ₀, where Θ₀ ⊂ Θ. Only a single Monte Carlo sample is required.

Finally, we recall the well-known fact that any valid family of p-values can be inverted to give valid confidence intervals. For α ∊ [0, 1] and c = 1, 2, define the random set

{\tilde{C}}_{c}^{α} (Z) = {θ \in Θ : {\tilde{p}}_{c} (θ, Z) > α} .

(4)

Then ${\tilde{C}}_{c}^{α}$ is a 1 − α confidence set for θ because pr_θ{ ${\tilde{C}}_{c}^{α}$ (Z) ∋ θ} = 1 − pr_θ{p̃_c(θ, Z) ⩽ α} ⩾ 1 − α, where pr_θ is calculated under the null hypothesis of P_θ. Only a single Monte Carlo sample is required. Section 4.2 illustrates this application. The idea of using importance sampling to construct confidence intervals from a single Monte Carlo sample was pointed out in Green (1992). See Bolviken & Skovlund (1996) and Garthwaite & Buckland (1992) for examples of other ways to create Monte Carlo confidence intervals.

3. Practical considerations

Validity of the corrected p-values holds for any n, but validity alone is not a useful property. The p-value approximations must also be close to the target p-value in order to have the appropriate power characteristics. Convergence to the target p-value typically requires large n, especially for importance sampling algorithms with high variance. Furthermore, for the special case of direct sampling, i.e., Q = P, excessively small n can adversely affect power (Davison & Hinkley, 1997, § 4.2.5). Presumably, a similar effect holds for importance sampling, although the effect is more nuanced, because the power characteristics of the p-value approximations can be strongly influenced by the choice of Q for small n, and the resulting hypothesis tests need not be uniformly less powerful than the target test.

It is clear that each of the corrected p-values defined in (2) has the same asymptotic behaviour as its uncorrected counterpart, so for sufficiently large n they are essentially equivalent. But, in practice, n will often be too small to ensure this equivalence. The concern for the investigator is that the corrected p-values may behave much worse than the uncorrected ones for practical choices of n. In particular, power suffers when the null hypothesis is false, the target p-value p is small, the uncorrected approximation p_c is also small, but the corrected approximation p̃_c is large. Careful choice of Q can lower the chances of this.

Each of the corrected p-values can be expressed as an interpolation between their respective uncorrected versions and another valid p-value:

{\tilde{p}}_{1} (z) = \frac{1}{n + 1} w (x) + (1 - \frac{1}{n + 1}) p_{1} (z),

(5)

{\tilde{p}}_{2} (z) = \frac{w (x)}{w (x) + \sum_{j = 1}^{n} w (y_{j})} + {1 - \frac{w (x)}{w (x) + \sum_{j = 1}^{n} w (y_{j})}} p_{2} (z),

(6)

where we write z = (x, y₁, …, y_n). In both cases, power suffers when X comes from a distribution in the alternative hypothesis, and Q is such that w(X) is typically large relative to w(Y₁), …, w(Y_n). Since w(Y_k) always has mean 1, problems might arise for alternative distributions that tend to make w(X) much larger than 1. This problem can be avoided by choosing Q so that it gives more weight than does P to regions of the sample space that are more typical under alternative distributions than they are under P. The problem can also be avoided by choosing Q to be similar to P so that the weights are close to 1 throughout the sample space. Most proposal distributions are designed with one of these two goals in mind, so the corrected p-values should behave well for well-designed proposal distributions. In practice, however, proposal distributions can be quite bad, and it can be helpful to look more closely at how the proposal affects the power of the corrected p-values when they have not yet converged to the target p-value.

From (5) we see that p̃₁ is an interpolation between p₁ and w(x), the latter being a valid p-value. Validity of w(X) can be seen either by Theorem 1 for c = 1, n = 0 or by the following calculation

pr {w (X) ⩽ α} = \sum_{x} 𝟙 {P (x) ⩽ α Q (x)} P (x) ⩽ \sum_{x} α Q (x) = α

(7)

for discrete settings that easily generalizes to more abstract settings. Testing w(X) = P(X)/Q(X) ⩽ α is simply a likelihood ratio test of P versus Q, although using the critical value α will give a test of size smaller than α. The test can be strongly conservative, because the bound in (7) can be far from tight. So, p̃₁ is an interpolation between a conservative likelihood ratio test of P versus Q and an importance sampling Monte Carlo approximation of the target p-value. If the effect of w(X) on p̃₁ has not disappeared, such as for small n, then p̃₁ is a p-value for the original null hypothesis, P, versus a new alternative hypothesis that is some amalgamation of the original alternative hypothesis and the proposal distribution Q. If Q is not in the alternative hypothesis, as it often will not be, this modified alternative hypothesis is important to keep in mind when interpreting any rejections.

The effect of Q on interpreting rejections is especially important in multiple testing situations. In these settings, importance sampling is often used to accelerate the approximation of tiny p-values via p₁. If p₁ ≈ 0, however, then p̃₁ ≈ w(x)/(n + 1) and Q plays a critical role in determining which null hypotheses are rejected after correcting for multiple comparisons. The investigator can take advantage of this by ensuring that a rejection of P in favour of Q is sensible for the problems at hand, and by ensuring that events with small p-values will be heavily weighted by Q.

Turning to (6), we see that p̃₂ is an interpolation between 1 and p₂. Since p₂ ⩽ 1, we always have p̃₂ ⩾ p₂, so p̃₂ leads to a more conservative test than p₂. Its conservativism is controlled by the ratio w(x)/∑_j w(y_j). If the ratio is small, then the correction ensures validity with little loss of power. If it is large, there may be a substantial loss of power when using the correction. The ratio will be approximately 1/n if P and Q are similar, which is often the design criterion when using p₂.

An important caveat for the main results is that Q is not allowed to depend on the observed value of X. This caveat precludes one of the classical uses of importance sampling, namely using the observed value of t(X) to design a Q that heavily weights the event {x : t(x) ⩾ t(X)}. In many cases, however, since the functional form of t is known, one can a priori design a Q that will heavily weight the event {x : t(x) ⩾ t(X)} whenever the p-value would be small. For example, given a family of proposal distributions {Q_ℓ}, each of which might be useful for a limited range of observed values of t(X), one can use finite mixture distributions of the form $Q = \sum_{ℓ = 1}^{L} λ_{ℓ} Q_{ℓ}$ to obtain more robust performance. For similar reasons, mixture distributions are also useful for creating Monte Carlo confidence intervals in which the same proposal distribution is used for a family of target distributions. The example in § 4.2 has more details, and Hesterberg (1995) contains a more general discussion of the utility of mixture distributions for importance sampling. This caveat about Q does not preclude conditional inference, in which case P is the null conditional distribution of X given some appropriate A(X), and all of the analysis takes place after conditioning on A(X). The choice of Q and t can thus depend on A(X), but not on additional details about X.

4. Applications

4.1. Accelerating multiple permutation tests

Consider a collection of N datasets. The ith dataset Xⁱ = (Vⁱ, Lⁱ) contains a sample of values $V^{i} = (V_{1}^{i}, \dots, V_{m_{i}}^{i})$ and corresponding labels $L^{i} = (L_{1}^{i}, \dots, L_{m_{i}}^{i})$ . The distributions over values and labels are unknown and perhaps unrelated across datasets. We are interested in identifying which datasets show dependence between the values and the labels. From a multiple hypothesis testing perspective, the ith null hypothesis is that Vⁱ and Lⁱ are independent, or more generally, that the values of Lⁱ are exchangeable conditioned on the values of Vⁱ. Given a test statistic t(x) = t(υ, ℓ), a permutation test p-value for the ith null hypothesis is

p (X^{i}) = {(m_{i}!)}^{- 1} \sum_{x} 𝟙 [t {V^{i}, {(L^{i})}^{(π)}} ⩾ t (V^{i}, L^{i})],

(8)

where the sum is over all permutations π = (π₁, …, π_{m_i}) of (1, …, m_i), and where the notation ℓ^(π) = (ℓ_π₁, …, ℓ_{π_{m_i}}) denotes a permuted version of the elements of ℓ using the permutation π. Inference is conditioned on the values and the labels, but not their pairings. The p-value in (8) agrees with (1) by taking the ith null distribution, Pⁱ, to be uniform over permutations of the labels. The Bonferroni correction can be used to adjust for multiple tests (Lehmann & Romano, 2005). In particular, rejecting null hypothesis i whenever p(Xⁱ) ⩽ α/N ensures that the probability of even a single false rejection is no more than α. Although computing p exactly is often prohibitive, Monte Carlo samples from Pⁱ are readily available for approximating p, and importance sampling can be used to accelerate the approximation of tiny p-values. The Bonferroni correction is still sensible, provided the Monte Carlo approximate p-values are valid p-values, which is ensured by using the corrected p-value approximations.

Consider N = 10⁴ independent datasets, each with m_i = 100 scalar values and corresponding binary labels. In each case, the 100 values are mutually independent and there are 40 labels of 1 and 60 labels of 0. In 9990 datasets, the values have a Cauchy distribution with median 0 and scale 1 irrespective of the labels, i.e., the null hypothesis is true. In 10 of the datasets, the value associated with label ℓ ∊ {0, 1} has a Cauchy distribution with median 2ℓ and scale 1, i.e., there is a relationship between values and labels, and the null hypothesis is false. This example was chosen because standard tests like the two-sample t-test and the Wilcoxon rank-sum test perform poorly. For the alternative hypothesis where label 1 tends to have larger values than label 0, a sensible test statistic is the difference in medians between values associated to label 1 and values associated to label 0, namely t(x) = t(υ, ℓ) = median(υ_j : ℓ_j = 1) − median(υ_j : ℓ_j = 0).

Table 1 shows the α = 0.05 Bonferroni-corrected performance of several different Monte Carlo approximations as n increases. The approximations p_P and p̃_P refer to p₁ and p̃₁, respectively, for the special case of Q = P and w ≡ 1, i.e., direct sampling from the uniform distribution over permutations. Using either p_P or p̃_P would be the standard method of approximating p. The approximations p₁ and p̃₁ are based on a proposal distribution that prefers permutations that pair large values of t with label 1, as described in the Appendix. In each case, we reject when the approximate p-value does not exceed α/N. The final approximation q₁ refers to a Wald 1 − α/(2N) upper confidence limit for p₁ using the same importance samples to approximate a standard error. For q₁, we reject whenever q₁ ⩽ α/(2N), which, if the confidence limits were valid, would provide the same guarantees as the Bonferroni correction.

Table 1.

Bonferroni-corrected performance versus Monte Carlo sample size for the simulation in § 4.1

	No. of correct rejections out of 10				No. of incorrect rejections out of 9990
n	10¹	10²	10³	10⁴	10¹	10²	10³	10⁴
p_P	10	10	10	10	908	87	12	1
p̃_P	0	0	0	0	0	0	0	0
p₁	9	8	7	5	7579	1903	2	0
p̃₁	7	6	6	5	0	0	0	0
q₁	6	5	4	3	4545	585	0	0

Open in a new tab

Only p̃₁ works well. The approximations that are not guaranteed to be valid p-values, namely p_P, p₁ and q₁, require excessively large n before the false detection rate drops to the Bonferroni target of zero errors. The cases n = 10³ and n = 10⁴ are computationally burdensome, and the situation only worsens when N increases. Furthermore, in a real problem, the investigator has no way of determining that n is large enough. The confidence limit procedure q₁ similarly failed because the standard error approximations are inaccurate for small n. The valid p-values, namely p̃_P and p̃₁, ensure that the Bonferonni correction works regardless of n, but p̃_P ⩾ 1/(n + 1), so it is not useful after a Bonferonni correction unless n is extremely large. The good performance of p̃₁ requires the combination of importance sampling, which allows tiny p-values to be approximated with small n, and the validity correction introduced in this paper, which allows multiple testing adjustments to work properly.

4.2. Exact inference for covariate effects in Rasch models

Let X = (X_ij : i = 1, …, M; j = 1, …, N) be a binary M × N matrix. Consider the following logistic regression model for X. The X_ij are independent Bernoulli (p_ij), where

log \frac{p_{i j}}{1 - p_{i j}} = κ + α_{i} + β_{j} + θ υ_{i j},

(9)

for unknown coefficients κ, α = (α₁, …, α_M), β = (β₁, …, β_N) and θ, and known covariates υ = (υ_ij : i = 1, …, M; j = 1, …, N). The special case θ = 0 is called the Rasch model and is commonly used to model the response of M subjects to N binary questions (Rasch, 1961). In this example, we discuss inference about θ when κ, α, β are treated as nuisance parameters.

Consider first the problem of testing the null hypothesis of θ= θ′ versus the alternative hypothesis of θ ≠ θ′. Conditioning on the row and column sums, say r = (r₁, …, r_M) and c = (c₁, …, c_N), removes the nuisance parameters (Cox, 1958; Mehta & Patel, 1995), and the original composite null hypothesis reduces to the simple, conditional, null hypothesis of X ∼ P_θ′, where P_θ′ is the conditional distribution of X given the margins r, c for the model in (9) with θ = θ′, namely,

P_{θ} (x) \propto exp (θ \sum_{i j} x_{i j} ν_{i j}) {\prod_{i} 𝟙 (\sum_{j} x_{i j} = r_{i})} {\prod_{j} 𝟙 (\sum_{i} x_{i j} = c_{j})} .

(10)

A sensible test statistic is the minimal sufficient statistic for θ under the conditional distribution P_θ defined in (10), namely t(x) = ∑_ij x_ijυ_ij. Since t (X) has power for detecting θ>θ′ and −t(X) has power for detecting θ < θ′, it is common to combine upper- and lower-tailed p-values into a single p-value, p^±(θ′, X), defined by

\begin{array}{l} p^{+} (θ, x) = {pr}_{θ} {t (X) ⩾ t (x)}, & p^{-} (θ, x) = {pr}_{θ} {- t (X) ⩾ - t (x)}, \\ p^{\pm} (θ, x) = min [1, 2 min {p^{+} (θ, x), p^{-} (θ, x)}], \end{array}

(11)

where pr_θ uses X ∼ P_θ. There are no practical algorithms for computing p^±(θ′, X) nor for direct sampling from P_θ′. We can, however, create a proposal distribution, say Q_θ, for any P_θ, as described in the Appendix. Although Q_θ(x) can be evaluated, P_θ(x) is known only up to a normalization constant, so we must use p₂ and p̃₂. To approximate p^±, we use two Monte Carlo p-value approximations, one for each of t and −t, and combine them as in (11). Validity of the underlying p-values ensures validity of the combined p-value.

Now consider the problem of creating confidence intervals for θ by inverting a family of Monte Carlo approximations of p^± as in (4). Inference is still conditioned on the row and column sums, giving a unique conditional distribution P_θ for each θ. Unlike hypothesis testing, where the test of θ = θ′ can use a different proposal distribution for each θ′, the use of (4) requires a common proposal distribution for all θ. To create a common proposal distribution that might work well in practice, we use a mixture of proposals designed for specific θ. In particular, define $Q = L^{- 1} \sum_{ℓ = 1}^{L} Q_{θ_{ℓ}}$ , where (θ₁, …, θ_L) are fixed. In the example below, we use L = 601 and θ₁, …, θ_L) = (−6.00, −5.98, −5.96, …, 6.00). For any θ, the importance weights are

w (θ, x) = \frac{P_{θ} (x)}{Q (x)} = c_{θ} \frac{exp {θ t (x)}}{\sum_{ℓ} Q_{θ_{ℓ}} (x)}

for any binary matrix x with the correct row and column sums, where c_θ is an unknown constant that is not needed for computing p₂ and p̃₂. We will use $p_{2}^{+} (θ, z)$ and $p_{2}^{-} (θ, z)$ to denote p₂(z) when computed using the importance weights w(θ, ·) and the test statistics +t and −t, respectively, where z = (x, y₁, …, y_n). We will use $p_{2}^{\pm} = min {1, 2 min (p_{2}^{+}, p_{2}^{-})}$ as in (11). Similarly, define ${\tilde{p}}_{2}^{+}$ , ${\tilde{p}}_{2}^{-}$ and ${\tilde{p}}_{2}^{\pm}$ for p̃₂. Inverting these p-values gives the confidence sets C^α(z) = {θ ∊ ℝ : $p_{2}^{\pm}$ (θ, z) > α} and C̃^α(z) = {θ ∊ ℝ : ${\tilde{p}}_{2}^{\pm}$ (θ, z) > α}. For fixed z, the approximate p-values are well-behaved functions of θ, so it is straightforward to numerically approximate the confidence sets, which will typically be intervals. From Theorem 1 and the discussion in § 2, we see that C̃^α maintains the nominal significance level, whereas C^α might not.

Table 2 describes the results of a simulation experiment that investigated the coverage properties of C^α and C̃^α. In that experiment, we took M = 200, N = 10, and fixed the parameters κ, θ, α, β and the covariates υ. We set θ = 2 and generated 1000 datasets and importance samples in order to approximate the true coverage probabilities and the median interval lengths of various confidence intervals. Confidence intervals based on the corrected p-values maintain a coverage probability of at least 1 − α without becoming excessively long, while those based on uncorrected p-values can have much lower coverage probabilities. As the number of importance samples increases, the two confidence intervals begin to agree. They also roughly agree with the confidence intervals given by the approximate conditional inference methods reviewed in Brazzale & Davison (2008) and implemented in the cond R package (Brazzale, 2005), which had coverage probabilities for this simulation in the range of 94.5–94.7% and median lengths of 1.15–1.16, depending on the method of approximation. The conservatism of exact inference for discrete problems (Brazzale & Davison, 2008) is not an issue here, because the typical dataset in this simulation gives a conditional distribution of t(X) with ∼ 10²⁰⁰ distinct values; for all practical purposes, the distribution is continuous.

Table 2.

Performance of 95% confidence intervals versus Monte Carlo sample size for the simulation in § 4.2

	% coverage probability				Median length
n	10	50	100	500	10	50	100	500
C^α	27.8	77.7	85.7	93.3	0.34	0.96	1.08	1.16
C̃^α	99.6	98.1	96.9	95.2	2.09	1.52	1.40	1.22

Open in a new tab

Comparisons with other exact methods are not available, because existing enumeration software (StataCorp, 2009; Cytel, 2010) are unable to enumerate the state space or generate direct samples with a few gigabytes of memory, and the corresponding Markov chain Monte Carlo methods (Zamar et al., 2007) often fail to adequately sample the state space, even after many hours of computing. Classical, unconditional, asymptotic confidence intervals also behave poorly, because of the large number of nuisance parameters. For this simulation, 95% Wald intervals had only 82.2% coverage probability with a median length of 1.24, emphasizing the utility of conditional inference.

5. Discussion

The practical benefits of using p̃_c over p_c are clear. Valid p-values are always crucial for multiple testing adjustments. But even for individual tests, p̃_c protects against false rejections resulting from high and possibly undiagnosed variability in the importance sampling approximations. There is almost no computational penalty for using the corrections, and little or no loss of power for well-behaved importance sampling algorithms. These advantages extend to confidence intervals constructed by inverting p̃_c.

The corrections are designed to improve the interpretability of tests and intervals, but they are not designed to improve the accuracy of the p-value approximation. If one is interested in approximating the probability of the event in (1), but this probability will not be interpreted as a p-value, then p_c may be preferable to p̃_c. The corrections can play a diagnostic role, since large differences between p̃_c and p_c indicate convergence failure for at least one of the approximations, but close agreement between p_c and p̃_c need not indicate convergence. All four p-value approximations can be in close agreement but still far from p.

In many cases, it may be sensible to report both p̃_c and p_c. Reporting p̃_c allows for hypothesis tests that respect the nominal level and approximate the power characteristics of p. Reporting p_c, along with further approximations of the Monte Carlo standard error, provides information about the numerical value of p. The poor performance of q₁ in § 4.1 shows that p_c cannot be meaningfully corrected by relying on approximate standard errors. To ensure the nominal level of tests and intervals, p̃_c is required.

Acknowledgments

This work was supported in part by the U.S. National Science Foundation and the U.S. National Institutes of Health, partly while the author was at the Department of Statistics at Carnegie Mellon University. The author thanks Stuart Geman, Jeffrey Miller, Lee Newberg, an associate editor and a referee for helpful comments.

Appendix

Proposal distributions

The simulation example in § 4.1 uses conditional inference, so the target and proposal distributions for each dataset i can depend on the observed values V = (V₁, …, V_m) and the fact that there are r one-labels and m − r zero labels, but cannot depend on the observed pairing of labels and values. We suppress the dependence on i in the notation. Let J = (J₁, …, J_m) be a permutation that makes V nonincreasing, i.e., V_J₁ ⩾ ⋯ ⩾ V_{J_m}. Choose a random permutation Π = (Π₁, …, Π_m) according to the distribution

pr (Π = π) = \frac{exp {θ \sum_{i = 1}^{r} 𝟙 (π_{i} ⩽ r)}}{r! (m - r)! \sum_{k = 0}^{m} (\begin{matrix} r \\ k \end{matrix}) (\begin{matrix} m - r \\ r - k \end{matrix}) exp (θ k)} (θ \in ℝ; r = 0, \dots, m), (\begin{matrix} a \\ b \end{matrix}) = \frac{a!}{b! (a - b)!},

where the binomial coefficients are defined to be zero if a < 0, b < 0 or a < b. Leaving V in the original observed order and permuting L so that L_{J_Π₁} = ⋯ = L_{J_{Π_r}} = 1 and L_{J_{Π_r+1}} = ⋯ = L_{J_{Π_m}} = 0 gives a random pairing of values with labels. The case θ = 0 is the uniform distribution over permutations, i.e., the target null conditional distribution. We used θ = 3 for the proposal distribution that assigns higher probability to those permutations that tend to match the label one with larger values.

The simulation example in § 4.2 also uses conditional inference, so the target and proposal can depend on the observed margins, r and c, of the matrix X. For the target distribution P_θ defined in (10), we create the proposal Q_θ by modifying the importance sampling algorithm described in the author’s 2009 unpublished preprint, arXiv:0906.1004v1, which is designed for the uniform distribution, i.e., θ = 0. Specifically, we use equation (3.12a) of that preprint, and we include the additional weights {exp(θν_ij)λ_ij}^b_i inside the product in equation (3.6) of that preprint when sampling from the jth column, where λ is a solution to the system of equations

λ_{i j} = (N - j) / [M \sum_{ℓ = j + 1}^{N} exp (θ ν_{i ℓ}) {\sum_{k = 1}^{M} λ_{k j} exp (θ ν_{k ℓ})}^{- 1}] (i = 1, \dots, M; j = 1, \dots, N - 1) .

Proofs

We prove Theorem 1 with a succession of lemmas. The notation and conventions follow § 2 with the following additions. We use E to denote expectation. For any sequence y = (y₀, …, y_n) and any k ∊ {0, …, n}, let y^k = (y_k, y₁, …, y_k₋₁, y₀, y_k+1, …, y_n), which is the sequence obtained by swapping the 0th element and the kth element in the original sequence y. Let Π denote a random permutation chosen uniformly from 𝒨 and chosen independently of Z. Let U = Z^(Π). If π ∊ 𝒨 is a fixed permutation, then Π and Π^(π) have the same distribution, which means U and U^(π) have the same distribution. In particular, U and U^k have the same distribution. Lemma A1 is the key algebraic inequality.

Lemma A1. For all t₀, …, t_n ∊ [−∞, ∞] and all α, w₀, …, w_n ∊ [0, ∞],

\sum_{k = 0}^{n} w_{k} 𝟙 {\sum_{i = 0}^{n} w_{i} 𝟙 (t_{i} ⩾ t_{k}) ⩽ α} ⩽ α .

Proof. Let H denote the left side of the desired inequality. We can assume that H > 0, since the statement is trivial otherwise. The pairs (t₀, w₀), …, (t_n, w_n) can be reordered without affecting the value of H, so we can assume that t₀ ⩾ ⋯ ⩾ t_n. This implies that $\sum_{i = 0}^{n} w_{i} 𝟙 (t_{i} ⩾ t_{k})$ is increasing in k and that there exists a k^* defined as the largest k for which $\sum_{i = 0}^{n} w_{i} 𝟙 (t_{i} ⩾ t_{k}) ⩽ α$ . So, $H = \sum_{k = 0}^{k *} w_{k} = \sum_{i = 0}^{k *} w_{i} 𝟙 (t_{i} ⩾ t_{k *}) ⩽ α$ .

Lemma A2. For any nonnegative, measurable function f on Sⁿ⁺¹,

E {f (Z)} = E {{(n + 1)}^{- 1} \sum_{k = 0}^{n} w (Y_{k}) f (Y^{k})} .

Proof. Recall that Z_i = Y_i ∼ Q for i ≠ 0 and that Z₀ = X ∼ P. A change of variables from X to Y₀ gives E {f (Z)} = E{w(Y₀) f (Y)}. Since the distribution of Y is invariant to permutations, we have E{w(Y₀) f (Y)} = E{w(Y_k) f (Y^k)} for each k = 0, …, n, which means that $E {f (Z)} = E {w (Y_{0}) f (Y)} = {(n + 1)}^{- 1} \sum_{k = 0}^{n} E {w (Y_{k}) f (Y^{k})}$ . Moving the sum inside the expectation completes the proof.

Lemma A3. Let P_Z and P_U denote the distributions of Z and U, respectively, over Sⁿ⁺¹. Then P_Z ≪ P_U and

\frac{d P z}{d P_{U}} (u) = \frac{(n + 1) w (u_{0})}{\sum_{j = 0}^{n} w (u_{j})}

almost surely.

Proof. Let g be any nonnegative, measurable function on Sⁿ⁺¹. It is enough to show that

E {g (Z)} = E {\frac{(n + 1) w (U_{0})}{\sum_{j = 0}^{n} w (U_{j})} g (U)} .

(A1)

For any π ∊ 𝒨, there is a unique inverse permutation π⁻¹ ∊ 𝒨 with $π_{π_{i}}^{- 1} = π_{π_{i}^{- 1}} = i$ , for each i = 0, …, n. Comparing with the proof of Lemma A2, for any nonnegative, measurable function f, we have

E {f (Z^{(π)})} = E {f (Y^{(π)}) w (Y_{0})} = E {f (Y) w (Y_{π_{0}^{- 1}})},

(A2)

where the first equality is a change of variables from Z to Y and the second equality follows from the fact that the distribution of Y is permutation invariant, so, in particular, Y and Y^(π−1) have the same distribution. Using (A2) gives

\begin{array}{l} E {f (U)} & = \frac{1}{(n + 1)!} \sum_{π \in 𝒨} E {f (U) | Π = π} = \frac{1}{(n + 1)!} \sum_{π \in 𝒨} E {f (Z^{(π)})} \\ = \frac{1}{(n + 1)!} \sum_{π \in 𝒨} E {f (Y) w (Y_{π_{0}^{- 1}})} = \frac{1}{(n + 1)!} \sum_{j = 0}^{n} \sum_{\underset{π_{0}^{- 1} = j}{π \in 𝒨 :}} E {f (Y) w (Y_{j})} \\ = \frac{1}{(n + 1)!} \sum_{j = 0}^{n} n! E {f (Y) w (Y_{j})} = E {\frac{\sum_{j = 0}^{n} w (Y_{j})}{n + 1} f (Y)} . \end{array}

(A3)

Applying (A3) to the function $f (u) = (n + 1) g (u) w (u_{0}) / \sum_{j = 0}^{n} w (u_{j})$ gives

E {\frac{(n + 1) w (U_{0})}{\sum_{j = 0}^{n} w (U_{j})} g (U)} = E {\frac{\sum_{j = 0}^{n} w (Y_{j})}{n + 1} \frac{(n + 1) w (Y_{0})}{\sum_{j = 0}^{n} w (Y_{j})} g (Y)} = E {w (Y_{0}) g (Y)} = E {g (Z)},

where the last equality is a change of variables as in (A2). This gives (A1) and completes the proof.

Lemma A4. For any nonnegative, measurable function f on Sⁿ⁺¹,

E {f (Z)} = E {\sum_{k = 0}^{n} \frac{w (U_{k})}{\sum_{j = 0}^{n} w (U_{j})} f (U^{k})} .

Proof. Changing variables from Z to U and using Lemma A3 gives

E {f (Z)} = E {\frac{(n + 1) w (U_{0})}{\sum_{j = 0}^{n} w (U_{j})} f (U)} .

(A4)

Since the distribution of U is invariant to permutations, we have

E {\frac{(n + 1) w (U_{0})}{\sum_{j = 0}^{n} w (U_{j})} f (U)} = E {\frac{(n + 1) w (U_{k})}{\sum_{j = 0}^{n} w (U_{j})} f (U^{k})} (k = 0, \dots, n) .

(A5)

Combining (A4) and (A5) and averaging over k gives

E {f (Z)} = E {\frac{(n + 1) w (U_{0})}{\sum_{j = 0}^{n} w (U_{j})} f (U)} = \frac{1}{n + 1} \sum_{k = 0}^{n} E {\frac{(n + 1) w (U_{k})}{\sum_{j = 0}^{n} w (U_{j})} f (U^{k})} .

Moving the sum inside the expectation and cancelling the (n + 1) terms completes the proof.

Proof of Theorem 1 for c = 1. Applying Lemma A2 to the function f (z) = 𝟙{p̃₁(z) ⩽ α} gives

\begin{array}{l} pr {{\tilde{p}}_{1} (Z) ⩽ α} & = E [𝟙 {{\tilde{p}}_{1} (Z) ⩽ α}] = E [{(n + 1)}^{- 1} \sum_{k = 0}^{n} w (Y_{k}) 𝟙 {{\tilde{p}}_{1} (Y^{k}) ⩽ α}] \\ = E ({(n + 1)}^{- 1} \sum_{k = 0}^{n} w (Y_{k}) 𝟙 [{(n + 1)}^{- 1} \sum_{i = 0}^{n} w (Y_{i}) 𝟙 {t (Y_{i}, Y) ⩾ t (Y_{k}, Y)} ⩽ α]) . \end{array}

The quantity inside the final expectation is always at most α, which follows from Lemma A1 by taking t_ℓ = t(Y_ℓ, Y) and w_ℓ = w(Y_ℓ)/(n + 1) for each ℓ = 0, …, n.

Proof of Theorem 1 for c = 2. Applying Lemma A4 to the function f (z) = 𝟙{p̃₂(z) ⩽ α} gives

\begin{array}{l} pr {{\tilde{p}}_{2} (Z) ⩽ α} & = E [𝟙 {{\tilde{p}}_{2} (Z) ⩽ α}] = E [\sum_{k = 0}^{n} \frac{w (U_{k})}{\sum_{j = 0}^{n} w (U_{j})} 𝟙 {{\tilde{p}}_{2} (U^{k}) ⩽ α}] \\ = E (\sum_{k = 0}^{n} \frac{w (U_{k})}{\sum_{j = 0}^{n} w (U_{j})} 𝟙 [\sum_{i = 0}^{n} \frac{w (U_{i})}{\sum_{j = 0}^{n} w (U_{j})} 𝟙 {t (U_{i}, U) ⩾ t (U_{k}, U)} ⩽ α]) . \end{array}

The quantity inside the final expectation is always at most α, which follows from Lemma A1 by taking t_ℓ = t(U_ℓ, U) and $w_{ℓ} = w (U_{ℓ}) / {\sum_{j = 0}^{n} w (U_{j})}$ for each ℓ = 0, …, n.

References

Besag J, Clifford P. Generalized Monte Carlo significance tests. Biometrika. 1989;76:633–42. [Google Scholar]
Bolviken E, Skovlund E. Confidence intervals from Monte Carlo tests. J Am Statist Assoc. 1996;91:1071–8. [Google Scholar]
Brazzale AR. hoa: an R package bundle for higher order likelihood inference. Rnews. 2005;5:20–7. ISSN 609–3631, ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf. [Google Scholar]
Brazzale AR, Davison AC. Accurate parametric inference for small samples. Statist Sci. 2008;23:465–84. [Google Scholar]
Cox DR. The regression analysis of binary sequences. J. R. Statist. Soc. B. 1958;20:215–42. [Google Scholar]
Cytel . LogXact. Cambridge, MA: Cytel Inc.; 2010. p. 9. [Google Scholar]
Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge, UK: Cambridge University Press; 1997. [Google Scholar]
Garthwaite PH, Buckland ST. Generating Monte Carlo confidence intervals by the Robbins–Monro process. J. R. Statist. Soc C. 1992;41:159–71. [Google Scholar]
Green PJ. Discussion of the paper by Geyer and Thompson. J. R. Statist. Soc B. 1992;54:683–4. [Google Scholar]
Hesterberg T. Weighted average importance sampling and defensive mixture distributions. Technometrics. 1995;37:185–94. [Google Scholar]
Lehmann EL, Romano JP. Testing Statistical Hypotheses. 3rd edn. New York: Springer; 2005. [Google Scholar]
Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]
Mehta CR, Patel NR. Exact logistic regression: theory and examples. Statist Med. 1995;14:2143–60. doi: 10.1002/sim.4780141908. [DOI] [PubMed] [Google Scholar]
Rasch G. On general laws and the meaning of measurement in psychology. In: Neyman J, editor. Proc 4th Berkeley Symp Math Statist Prob: Prob Theory. Vol. 4. Berkeley, CA: University of California Press; 1961. pp. 321–34. [Google Scholar]
StataCorp . Stata Statistical Software: Release. College Station, TX: StataCorp LP; 2009. p. 11. [Google Scholar]
Westfall PH, Young SS. Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. New York: Wiley-Interscience; 1993. [Google Scholar]
Zamar D, McNeney B, Graham J. Elrm: software implementing exact-like inference for logistic regression models. J. Statist. Software. 2007;21:1–18. [Google Scholar]

[b1-asr079] Besag J, Clifford P. Generalized Monte Carlo significance tests. Biometrika. 1989;76:633–42. [Google Scholar]

[b2-asr079] Bolviken E, Skovlund E. Confidence intervals from Monte Carlo tests. J Am Statist Assoc. 1996;91:1071–8. [Google Scholar]

[b3-asr079] Brazzale AR. hoa: an R package bundle for higher order likelihood inference. Rnews. 2005;5:20–7. ISSN 609–3631, ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf. [Google Scholar]

[b4-asr079] Brazzale AR, Davison AC. Accurate parametric inference for small samples. Statist Sci. 2008;23:465–84. [Google Scholar]

[b5-asr079] Cox DR. The regression analysis of binary sequences. J. R. Statist. Soc. B. 1958;20:215–42. [Google Scholar]

[b6-asr079] Cytel . LogXact. Cambridge, MA: Cytel Inc.; 2010. p. 9. [Google Scholar]

[b7-asr079] Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge, UK: Cambridge University Press; 1997. [Google Scholar]

[b8-asr079] Garthwaite PH, Buckland ST. Generating Monte Carlo confidence intervals by the Robbins–Monro process. J. R. Statist. Soc C. 1992;41:159–71. [Google Scholar]

[b9-asr079] Green PJ. Discussion of the paper by Geyer and Thompson. J. R. Statist. Soc B. 1992;54:683–4. [Google Scholar]

[b10-asr079] Hesterberg T. Weighted average importance sampling and defensive mixture distributions. Technometrics. 1995;37:185–94. [Google Scholar]

[b11-asr079] Lehmann EL, Romano JP. Testing Statistical Hypotheses. 3rd edn. New York: Springer; 2005. [Google Scholar]

[b12-asr079] Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]

[b13-asr079] Mehta CR, Patel NR. Exact logistic regression: theory and examples. Statist Med. 1995;14:2143–60. doi: 10.1002/sim.4780141908. [DOI] [PubMed] [Google Scholar]

[b14-asr079] Rasch G. On general laws and the meaning of measurement in psychology. In: Neyman J, editor. Proc 4th Berkeley Symp Math Statist Prob: Prob Theory. Vol. 4. Berkeley, CA: University of California Press; 1961. pp. 321–34. [Google Scholar]

[b15-asr079] StataCorp . Stata Statistical Software: Release. College Station, TX: StataCorp LP; 2009. p. 11. [Google Scholar]

[b16-asr079] Westfall PH, Young SS. Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. New York: Wiley-Interscience; 1993. [Google Scholar]

[b17-asr079] Zamar D, McNeney B, Graham J. Elrm: software implementing exact-like inference for logistic regression models. J. Statist. Software. 2007;21:1–18. [Google Scholar]

PERMALINK

Conservative hypothesis tests and confidence intervals using importance sampling

Matthew T Harrison

Summary

1. Introduction

2. Main results

3. Practical considerations

4. Applications

4.1. Accelerating multiple permutation tests

Table 1.

4.2. Exact inference for covariate effects in Rasch models

Table 2.

5. Discussion

Acknowledgments

Appendix

Proposal distributions

Proofs

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Conservative hypothesis tests and confidence intervals using importance sampling

Matthew T Harrison

Summary

1. Introduction

2. Main results

3. Practical considerations

4. Applications

4.1. Accelerating multiple permutation tests

Table 1.

4.2. Exact inference for covariate effects in Rasch models

Table 2.

5. Discussion

Acknowledgments

Appendix

Proposal distributions

Proofs

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases