Skip to main content
Biometrika logoLink to Biometrika
. 2012 Feb 7;99(1):57–69. doi: 10.1093/biomet/asr079

Conservative hypothesis tests and confidence intervals using importance sampling

Matthew T Harrison 1
PMCID: PMC3412608  PMID: 23049134

Summary

Importance sampling is a common technique for Monte Carlo approximation, including that of p-values. Here it is shown that a simple correction of the usual importance sampling p-values provides valid p-values, meaning that a hypothesis test created by rejecting the null hypothesis when the p-value is at most α will also have a Type I error rate of at most α. This correction uses the importance weight of the original observation, which gives valuable diagnostic information under the null hypothesis. Using the corrected p-values can be crucial for multiple testing and also in problems where evaluating the accuracy of importance sampling approximations is difficult. Inverting the corrected p-values provides a useful way to create Monte Carlo confidence intervals that maintain the nominal significance level and use only a single Monte Carlo sample.

Some key words: Exact inference, Monte Carlo simulation, Multiple testing, p-value, Rasch model

1. Introduction

Importance sampling is a common technique for Monte Carlo approximation. Besides its use in situations where efficient direct-sampling algorithms are unavailable, it can be used to accelerate the approximation of small p-values as needed for multiple hypothesis testing. Importance sampling can also be used for Monte Carlo approximation of confidence intervals using a single Monte Carlo sample by inverting a family of hypothesis tests. The practicality of these two uses of importance sampling, accelerated multiple testing and Monte Carlo approximation of confidence intervals, would seem to be limited by the fact that each places strong requirements on the importance sampling procedure. Multiple testing controls can be sensitive to small absolute errors but large relative errors in small p-values, which would seem to demand either excessively large Monte Carlo sample sizes or unrealistically accurate importance sampling proposal distributions in order to reduce absolute errors to tolerable levels. The confidence interval procedure uses a single proposal distribution to approximate probabilities under a large family of target distributions, which again would seem to demand large sample sizes in order to overcome the high variability that is frequently encountered when the proposal distribution is not tailored to a specific target distribution. Nevertheless, as shown here, simple corrections of the usual p-value approximations can be used to overcome these difficulties, making importance sampling a practical choice for accelerated multiple testing and for constructing Monte Carlo confidence intervals.

The p-value corrections that we introduce require negligible additional computation and still converge to the target p-value, but are valid p-values, meaning the probability of rejecting a null hypothesis does not exceed α for any specified level α. Hypothesis tests and confidence intervals constructed from the corrected p-value approximations are guaranteed to be conservative, regardless of the Monte Carlo sample size, while still behaving like the target tests and intervals for sufficiently large Monte Carlo sample sizes. The combination of being both conservative and consistent turns out to be crucial in many applications where the importance sampling variability cannot be adequately controlled with practical amounts of computing, including multiple testing, confidence intervals and any hypothesis testing situation where the true variability of the importance sampling algorithm is either large or unknown.

Let X denote the observed data. Assume that the null hypothesis specifies a known probability distribution P for X. For a specified test statistic t(X), the goal is to compute the p-value p(X) defined by

p(x)=pr{t(X)t(x)}, (1)

where the probability is computed under the null hypothesis. Importance sampling can be used to approximate p(X) if the p-value cannot be determined analytically. Let Y = (Y1, …, Yn) be a random sample from the proposal distribution Q whose support includes the support of the target distribution P. Then p(X) can be approximated with either of

p1(X,Y)=i=1nw(Yi)𝟙{t(Yi)t(X)}n,p2(X,Y)=i=1nw(Yi)𝟙{t(Yi)t(X)}j=1nw(Yj),

where 𝟙 is the indicator function, and the importance weights are defined to be the ratio of probability mass functions in the discrete case, i.e., w(x) = P(x)/Q(x), and the ratio of densities in the continuous case. Each pc is a consistent approximation of p(X) as the Monte Carlo sample size n increases. The approximation p1 is unbiased and is especially useful for accelerating the approximation of extremely small p-values. The approximation p2 can be evaluated even if the importance weights are known only up to a constant of proportionality. The reader is referred to Liu (2001) for details and references concerning importance sampling and to Lehmann & Romano (2005) for hypothesis testing.

Here we propose the following simple corrections of pc that use the importance weight of the original observation, namely

p˜1(X,Y)=w(X)+i=1nw(Yi)𝟙{t(Yi)t(X)}1+n,p˜2(X,Y)=w(X)+i=1nw(Yi)𝟙{t(Yi)t(X)}w(X)+j=1nw(Yj). (2)

We show that the corrected p-value approximations, while clearly still consistent approximations of the target p-value p(X), are also themselves valid, meaning

pr{p˜c(X,Y)α}α(0α1;n=0,1,;c=1,2),

under the null hypothesis, where the probability is with respect to the joint distribution of data and Monte Carlo sample.

For the special case of direct sampling, i.e., Q = P and w ≡ 1, the validity of p˜c(X,Y)=(1+n)1[1+i=1n𝟙{t(Yi)t(X)}] is well known and is often recommended over pc for use as a p-value (Davison & Hinkley, 1997, Ch. 4). This paper was motivated by trying to extend the special case of direct sampling to the more general case of importance sampling. For Markov chain Monte Carlo approaches, Besag & Clifford (1989) demonstrate how to generate valid p-value approximations using techniques that are unrelated to the ones here. It would be interesting to know whether the two approaches could be combined so that the importance samples could also be generated by a Markov chain.

2. Main results

The main result is that each c is a valid p-value. We generalize the introductory discussion in two directions. First, we allow arbitrary distributions, so the importance weights become the Radon–Nikodym derivative dP/dQ. In the discrete case, this simplifies to the ratio of probability mass functions and, in the continuous case, to the ratio of probability density functions. Secondly, we allow the choice of test statistic to depend on (X, Y1, …, Yn) as long as the choice is invariant to permutations of (X, Y1, …, Yn). We express this mathematically by writing the test statistic, t, as a function of two arguments: the first is the same as before, but the second argument takes the entire sequence (X, Y1, …, Yn), although we require that t is invariant to permutations in the second argument. For example, we may want to transform the sequence (X, Y1, …, Yn) in some way, either before or after applying a test statistic to the individual entries. As long as the transformation procedure is permutation invariant, such as centring and scaling, or converting to ranks, the main result still holds. Transformations are often desirable in multiple testing contexts for improving balance (Westfall & Young, 1993).

We begin with the precise notation and assumptions for the main result. Let P and Q denote probability distributions defined on the same measurable space (S, 𝒮) with PQ, meaning that sets with positive P probability also have positive Q probability, and let w denote a fixed, nonnegative version of the Radon–Nikodym derivative dP/dQ. The set 𝒨 denotes all (n + 1)! permutations π = (π0, …, πn) of (0, …, n). For π ∊ 𝒨 and z = (z0, …, zn), we define z(π) = (zπ0, …, zπn). Assume that t : S × Sn+1 ↦ [−∞, ∞] is a measurable function satisfying t(a, z) = t(a, z(π)) for all aS, zSn+1 and π ∊ 𝒨. For zSn+1, define

p˜1(z)=i=0nw(zi)𝟙{t(zi,z)t(z0,z)}n+1,p˜2(z)=i=0nw(zi)𝟙{t(zi,z)t(z0,z)}j=0nw(zj), (3)

where we use the convention 0/0 = 0. Let X have distribution P and Y0, Y1, …, Yn be independent and identically distributed with common distribution Q, independent of X. For notational convenience, define Z = (Z0, Z1, …, Zn) by Z0 = X, Z1 = Y1, …, Zn = Yn, so that the corrected p-values defined in (2) can be written as c(Z) from (3). Then, we obtain the following theorem.

Theorem 1. For each c = 1, 2, we have pr{c(Z) ⩽ α} ⩽ α for all α ∊ [0, 1].

Proof. See the Appendix.

The theorem does not require any special relationships among P, Q, t or n. For example, in parametric settings, Q does not need to be in the same model class as the null and/or alternative hypotheses. Validity of the corrected p-values is ensured even for unusual cases such as n = 0 or importance weights with infinite variance. We discuss some practical considerations for choosing Q in the next section; presumably, P, t and n are dictated by the application.

The theorem continues to hold if each Yk is chosen from a different Q, say Qk, as long as the sequence Q0, Q1, …, Qn is itself independent and identically distributed from some distribution over proposal distributions. The kth importance weight is now a fixed version of the Radon–Nikodym derivative d P/d Qk. This generalization can be useful in practice when each Yk is generated hierarchically by first choosing an auxiliary variable Vk and then, given Vk, choosing Yk from some distribution QVk that depends on Vk. If the marginal distribution Q of Yk is not available, one can use importance weights based on the conditional distribution QVk of Yk given Vk. This is equivalent to using random proposal distributions as described above. The drawback of combining this approach with the p-value corrections is that the importance weight of the original observation is evaluated using dP/d Q0, where Q0 is a randomly chosen proposal distribution. This further increases the Monte Carlo variation inherent in the p-value approximations. Generalizing the theorem to allow for random proposals requires no additional work. If ν is the joint distribution of each (Qk, Yk) and νQ is the marginal distribution of each Qk, simply apply the theorem with νQ × P in place of P, ν in place of Q, (Q0, X) in place of X and (Qk, Yk) in place of Yk. If one uses a regular conditional distribution to define the importance weight {d(νQ × P)/dν}(Qk, x), then it will simplify to a version of (d Qk/dP)(x).

The theorem concerns p-values for the simple hypothesis test that XP, but valid p-values for simple tests can be used to generate valid p-values for composite tests by maximizing over all distributions in the null hypothesis. Consider a collection of distributions, {Pθ: θ ∊ Θ}, corresponding test statistics, {tθ: θ ∊ Θ}, and corresponding importance weights, wθ= d Pθ/dQ, where the proposal distribution, Q, satisfies PθQ for all θ ∊ Θ. For each θ, define the corrected p-value approximations c(θ, z) as in (3) using tθ and wθ. The theorem shows that each c (θ0, Z) is a valid p-value for the simple null hypothesis that XPθ for θ = θ0. An immediate corollary is that supθ∊Θ0 c(θ, Z) is a valid p-value for the composite null hypothesis that XPθ for some θ ∊ Θ0, where Θ0 ⊂ Θ. Only a single Monte Carlo sample is required.

Finally, we recall the well-known fact that any valid family of p-values can be inverted to give valid confidence intervals. For α ∊ [0, 1] and c = 1, 2, define the random set

C˜cα(Z)={θΘ:p˜c(θ,Z)>α}. (4)

Then C˜cα is a 1 − α confidence set for θ because prθ{ C˜cα (Z) ∋ θ} = 1 − prθ{c(θ, Z) ⩽ α} ⩾ 1 − α, where prθ is calculated under the null hypothesis of Pθ. Only a single Monte Carlo sample is required. Section 4.2 illustrates this application. The idea of using importance sampling to construct confidence intervals from a single Monte Carlo sample was pointed out in Green (1992). See Bolviken & Skovlund (1996) and Garthwaite & Buckland (1992) for examples of other ways to create Monte Carlo confidence intervals.

3. Practical considerations

Validity of the corrected p-values holds for any n, but validity alone is not a useful property. The p-value approximations must also be close to the target p-value in order to have the appropriate power characteristics. Convergence to the target p-value typically requires large n, especially for importance sampling algorithms with high variance. Furthermore, for the special case of direct sampling, i.e., Q = P, excessively small n can adversely affect power (Davison & Hinkley, 1997, § 4.2.5). Presumably, a similar effect holds for importance sampling, although the effect is more nuanced, because the power characteristics of the p-value approximations can be strongly influenced by the choice of Q for small n, and the resulting hypothesis tests need not be uniformly less powerful than the target test.

It is clear that each of the corrected p-values defined in (2) has the same asymptotic behaviour as its uncorrected counterpart, so for sufficiently large n they are essentially equivalent. But, in practice, n will often be too small to ensure this equivalence. The concern for the investigator is that the corrected p-values may behave much worse than the uncorrected ones for practical choices of n. In particular, power suffers when the null hypothesis is false, the target p-value p is small, the uncorrected approximation pc is also small, but the corrected approximation c is large. Careful choice of Q can lower the chances of this.

Each of the corrected p-values can be expressed as an interpolation between their respective uncorrected versions and another valid p-value:

p˜1(z)=1n+1w(x)+(11n+1)p1(z), (5)
p˜2(z)=w(x)w(x)+j=1nw(yj)+{1w(x)w(x)+j=1nw(yj)}p2(z), (6)

where we write z = (x, y1, …, yn). In both cases, power suffers when X comes from a distribution in the alternative hypothesis, and Q is such that w(X) is typically large relative to w(Y1), …, w(Yn). Since w(Yk) always has mean 1, problems might arise for alternative distributions that tend to make w(X) much larger than 1. This problem can be avoided by choosing Q so that it gives more weight than does P to regions of the sample space that are more typical under alternative distributions than they are under P. The problem can also be avoided by choosing Q to be similar to P so that the weights are close to 1 throughout the sample space. Most proposal distributions are designed with one of these two goals in mind, so the corrected p-values should behave well for well-designed proposal distributions. In practice, however, proposal distributions can be quite bad, and it can be helpful to look more closely at how the proposal affects the power of the corrected p-values when they have not yet converged to the target p-value.

From (5) we see that 1 is an interpolation between p1 and w(x), the latter being a valid p-value. Validity of w(X) can be seen either by Theorem 1 for c = 1, n = 0 or by the following calculation

pr{w(X)α}=x𝟙{P(x)αQ(x)}P(x)xαQ(x)=α (7)

for discrete settings that easily generalizes to more abstract settings. Testing w(X) = P(X)/Q(X) ⩽ α is simply a likelihood ratio test of P versus Q, although using the critical value α will give a test of size smaller than α. The test can be strongly conservative, because the bound in (7) can be far from tight. So, 1 is an interpolation between a conservative likelihood ratio test of P versus Q and an importance sampling Monte Carlo approximation of the target p-value. If the effect of w(X) on 1 has not disappeared, such as for small n, then 1 is a p-value for the original null hypothesis, P, versus a new alternative hypothesis that is some amalgamation of the original alternative hypothesis and the proposal distribution Q. If Q is not in the alternative hypothesis, as it often will not be, this modified alternative hypothesis is important to keep in mind when interpreting any rejections.

The effect of Q on interpreting rejections is especially important in multiple testing situations. In these settings, importance sampling is often used to accelerate the approximation of tiny p-values via p1. If p1 ≈ 0, however, then 1w(x)/(n + 1) and Q plays a critical role in determining which null hypotheses are rejected after correcting for multiple comparisons. The investigator can take advantage of this by ensuring that a rejection of P in favour of Q is sensible for the problems at hand, and by ensuring that events with small p-values will be heavily weighted by Q.

Turning to (6), we see that 2 is an interpolation between 1 and p2. Since p2 ⩽ 1, we always have 2p2, so 2 leads to a more conservative test than p2. Its conservativism is controlled by the ratio w(x)/∑j w(yj). If the ratio is small, then the correction ensures validity with little loss of power. If it is large, there may be a substantial loss of power when using the correction. The ratio will be approximately 1/n if P and Q are similar, which is often the design criterion when using p2.

An important caveat for the main results is that Q is not allowed to depend on the observed value of X. This caveat precludes one of the classical uses of importance sampling, namely using the observed value of t(X) to design a Q that heavily weights the event {x : t(x) ⩾ t(X)}. In many cases, however, since the functional form of t is known, one can a priori design a Q that will heavily weight the event {x : t(x) ⩾ t(X)} whenever the p-value would be small. For example, given a family of proposal distributions {Q}, each of which might be useful for a limited range of observed values of t(X), one can use finite mixture distributions of the form Q==1LλQ to obtain more robust performance. For similar reasons, mixture distributions are also useful for creating Monte Carlo confidence intervals in which the same proposal distribution is used for a family of target distributions. The example in § 4.2 has more details, and Hesterberg (1995) contains a more general discussion of the utility of mixture distributions for importance sampling. This caveat about Q does not preclude conditional inference, in which case P is the null conditional distribution of X given some appropriate A(X), and all of the analysis takes place after conditioning on A(X). The choice of Q and t can thus depend on A(X), but not on additional details about X.

4. Applications

4.1. Accelerating multiple permutation tests

Consider a collection of N datasets. The ith dataset Xi = (Vi, Li) contains a sample of values Vi=(V1i,,Vmii) and corresponding labels Li=(L1i,,Lmii). The distributions over values and labels are unknown and perhaps unrelated across datasets. We are interested in identifying which datasets show dependence between the values and the labels. From a multiple hypothesis testing perspective, the ith null hypothesis is that Vi and Li are independent, or more generally, that the values of Li are exchangeable conditioned on the values of Vi. Given a test statistic t(x) = t(υ, ℓ), a permutation test p-value for the ith null hypothesis is

p(Xi)=(mi!)1x𝟙[t{Vi,(Li)(π)}t(Vi,Li)], (8)

where the sum is over all permutations π = (π1, …, πmi) of (1, …, mi), and where the notation (π) = (π1, …, πmi) denotes a permuted version of the elements of using the permutation π. Inference is conditioned on the values and the labels, but not their pairings. The p-value in (8) agrees with (1) by taking the ith null distribution, Pi, to be uniform over permutations of the labels. The Bonferroni correction can be used to adjust for multiple tests (Lehmann & Romano, 2005). In particular, rejecting null hypothesis i whenever p(Xi) ⩽ α/N ensures that the probability of even a single false rejection is no more than α. Although computing p exactly is often prohibitive, Monte Carlo samples from Pi are readily available for approximating p, and importance sampling can be used to accelerate the approximation of tiny p-values. The Bonferroni correction is still sensible, provided the Monte Carlo approximate p-values are valid p-values, which is ensured by using the corrected p-value approximations.

Consider N = 104 independent datasets, each with mi = 100 scalar values and corresponding binary labels. In each case, the 100 values are mutually independent and there are 40 labels of 1 and 60 labels of 0. In 9990 datasets, the values have a Cauchy distribution with median 0 and scale 1 irrespective of the labels, i.e., the null hypothesis is true. In 10 of the datasets, the value associated with label ∊ {0, 1} has a Cauchy distribution with median 2 and scale 1, i.e., there is a relationship between values and labels, and the null hypothesis is false. This example was chosen because standard tests like the two-sample t-test and the Wilcoxon rank-sum test perform poorly. For the alternative hypothesis where label 1 tends to have larger values than label 0, a sensible test statistic is the difference in medians between values associated to label 1 and values associated to label 0, namely t(x) = t(υ, ℓ) = median(υj : j = 1) − median(υj : j = 0).

Table 1 shows the α = 0.05 Bonferroni-corrected performance of several different Monte Carlo approximations as n increases. The approximations pP and P refer to p1 and 1, respectively, for the special case of Q = P and w ≡ 1, i.e., direct sampling from the uniform distribution over permutations. Using either pP or P would be the standard method of approximating p. The approximations p1 and 1 are based on a proposal distribution that prefers permutations that pair large values of t with label 1, as described in the Appendix. In each case, we reject when the approximate p-value does not exceed α/N. The final approximation q1 refers to a Wald 1 − α/(2N) upper confidence limit for p1 using the same importance samples to approximate a standard error. For q1, we reject whenever q1α/(2N), which, if the confidence limits were valid, would provide the same guarantees as the Bonferroni correction.

Table 1.

Bonferroni-corrected performance versus Monte Carlo sample size for the simulation in § 4.1

No. of correct rejections out of 10 No. of incorrect rejections out of 9990
n 101 102 103 104 101 102 103 104
pP 10 10 10 10 908 87 12 1
P 0 0 0 0 0 0 0 0
p1 9 8 7 5 7579 1903 2 0
1 7 6 6 5 0 0 0 0
q1 6 5 4 3 4545 585 0 0

Only 1 works well. The approximations that are not guaranteed to be valid p-values, namely pP, p1 and q1, require excessively large n before the false detection rate drops to the Bonferroni target of zero errors. The cases n = 103 and n = 104 are computationally burdensome, and the situation only worsens when N increases. Furthermore, in a real problem, the investigator has no way of determining that n is large enough. The confidence limit procedure q1 similarly failed because the standard error approximations are inaccurate for small n. The valid p-values, namely P and 1, ensure that the Bonferonni correction works regardless of n, but P ⩾ 1/(n + 1), so it is not useful after a Bonferonni correction unless n is extremely large. The good performance of 1 requires the combination of importance sampling, which allows tiny p-values to be approximated with small n, and the validity correction introduced in this paper, which allows multiple testing adjustments to work properly.

4.2. Exact inference for covariate effects in Rasch models

Let X = (Xij : i = 1, …, M; j = 1, …, N) be a binary M × N matrix. Consider the following logistic regression model for X. The Xij are independent Bernoulli (pij), where

logpij1pij=κ+αi+βj+θυij, (9)

for unknown coefficients κ, α = (α1, …, αM), β = (β1, …, βN) and θ, and known covariates υ = (υij : i = 1, …, M; j = 1, …, N). The special case θ = 0 is called the Rasch model and is commonly used to model the response of M subjects to N binary questions (Rasch, 1961). In this example, we discuss inference about θ when κ, α, β are treated as nuisance parameters.

Consider first the problem of testing the null hypothesis of θ= θ′ versus the alternative hypothesis of θ ≠ θ′. Conditioning on the row and column sums, say r = (r1, …, rM) and c = (c1, …, cN), removes the nuisance parameters (Cox, 1958; Mehta & Patel, 1995), and the original composite null hypothesis reduces to the simple, conditional, null hypothesis of XPθ′, where Pθ′ is the conditional distribution of X given the margins r, c for the model in (9) with θ = θ′, namely,

Pθ(x)exp(θijxijνij){i𝟙(jxij=ri)}{j𝟙(ixij=cj)}. (10)

A sensible test statistic is the minimal sufficient statistic for θ under the conditional distribution Pθ defined in (10), namely t(x) = ∑ij xijυij. Since t (X) has power for detecting θ>θ′ and −t(X) has power for detecting θ < θ′, it is common to combine upper- and lower-tailed p-values into a single p-value, p±(θ′, X), defined by

p+(θ,x)=prθ{t(X)t(x)},p(θ,x)=prθ{t(X)t(x)},p±(θ,x)=min[1,2min{p+(θ,x),p(θ,x)}], (11)

where prθ uses XPθ. There are no practical algorithms for computing p±(θ′, X) nor for direct sampling from Pθ′. We can, however, create a proposal distribution, say Qθ, for any Pθ, as described in the Appendix. Although Qθ(x) can be evaluated, Pθ(x) is known only up to a normalization constant, so we must use p2 and 2. To approximate p±, we use two Monte Carlo p-value approximations, one for each of t and −t, and combine them as in (11). Validity of the underlying p-values ensures validity of the combined p-value.

Now consider the problem of creating confidence intervals for θ by inverting a family of Monte Carlo approximations of p± as in (4). Inference is still conditioned on the row and column sums, giving a unique conditional distribution Pθ for each θ. Unlike hypothesis testing, where the test of θ = θ′ can use a different proposal distribution for each θ′, the use of (4) requires a common proposal distribution for all θ. To create a common proposal distribution that might work well in practice, we use a mixture of proposals designed for specific θ. In particular, define Q=L1=1LQθ, where (θ1, …, θL) are fixed. In the example below, we use L = 601 and θ1, …, θL) = (−6.00, −5.98, −5.96, …, 6.00). For any θ, the importance weights are

w(θ,x)=Pθ(x)Q(x)=cθexp{θt(x)}Qθ(x)

for any binary matrix x with the correct row and column sums, where cθ is an unknown constant that is not needed for computing p2 and 2. We will use p2+(θ,z) and p2(θ,z) to denote p2(z) when computed using the importance weights w(θ, ·) and the test statistics +t and −t, respectively, where z = (x, y1, …, yn). We will use p2±=min{1,2min(p2+,p2)} as in (11). Similarly, define p˜2+, p˜2 and p˜2± for 2. Inverting these p-values gives the confidence sets Cα(z) = {θ ∊ ℝ : p2± (θ, z) > α} and α(z) = {θ ∊ ℝ : p˜2± (θ, z) > α}. For fixed z, the approximate p-values are well-behaved functions of θ, so it is straightforward to numerically approximate the confidence sets, which will typically be intervals. From Theorem 1 and the discussion in § 2, we see that α maintains the nominal significance level, whereas Cα might not.

Table 2 describes the results of a simulation experiment that investigated the coverage properties of Cα and α. In that experiment, we took M = 200, N = 10, and fixed the parameters κ, θ, α, β and the covariates υ. We set θ = 2 and generated 1000 datasets and importance samples in order to approximate the true coverage probabilities and the median interval lengths of various confidence intervals. Confidence intervals based on the corrected p-values maintain a coverage probability of at least 1 − α without becoming excessively long, while those based on uncorrected p-values can have much lower coverage probabilities. As the number of importance samples increases, the two confidence intervals begin to agree. They also roughly agree with the confidence intervals given by the approximate conditional inference methods reviewed in Brazzale & Davison (2008) and implemented in the cond R package (Brazzale, 2005), which had coverage probabilities for this simulation in the range of 94.5–94.7% and median lengths of 1.15–1.16, depending on the method of approximation. The conservatism of exact inference for discrete problems (Brazzale & Davison, 2008) is not an issue here, because the typical dataset in this simulation gives a conditional distribution of t(X) with ∼ 10200 distinct values; for all practical purposes, the distribution is continuous.

Table 2.

Performance of 95% confidence intervals versus Monte Carlo sample size for the simulation in § 4.2

% coverage probability Median length
n 10 50 100 500 10 50 100 500
Cα 27.8 77.7 85.7 93.3 0.34 0.96 1.08 1.16
α 99.6 98.1 96.9 95.2 2.09 1.52 1.40 1.22

Comparisons with other exact methods are not available, because existing enumeration software (StataCorp, 2009; Cytel, 2010) are unable to enumerate the state space or generate direct samples with a few gigabytes of memory, and the corresponding Markov chain Monte Carlo methods (Zamar et al., 2007) often fail to adequately sample the state space, even after many hours of computing. Classical, unconditional, asymptotic confidence intervals also behave poorly, because of the large number of nuisance parameters. For this simulation, 95% Wald intervals had only 82.2% coverage probability with a median length of 1.24, emphasizing the utility of conditional inference.

5. Discussion

The practical benefits of using c over pc are clear. Valid p-values are always crucial for multiple testing adjustments. But even for individual tests, c protects against false rejections resulting from high and possibly undiagnosed variability in the importance sampling approximations. There is almost no computational penalty for using the corrections, and little or no loss of power for well-behaved importance sampling algorithms. These advantages extend to confidence intervals constructed by inverting c.

The corrections are designed to improve the interpretability of tests and intervals, but they are not designed to improve the accuracy of the p-value approximation. If one is interested in approximating the probability of the event in (1), but this probability will not be interpreted as a p-value, then pc may be preferable to c. The corrections can play a diagnostic role, since large differences between c and pc indicate convergence failure for at least one of the approximations, but close agreement between pc and c need not indicate convergence. All four p-value approximations can be in close agreement but still far from p.

In many cases, it may be sensible to report both c and pc. Reporting c allows for hypothesis tests that respect the nominal level and approximate the power characteristics of p. Reporting pc, along with further approximations of the Monte Carlo standard error, provides information about the numerical value of p. The poor performance of q1 in § 4.1 shows that pc cannot be meaningfully corrected by relying on approximate standard errors. To ensure the nominal level of tests and intervals, c is required.

Acknowledgments

This work was supported in part by the U.S. National Science Foundation and the U.S. National Institutes of Health, partly while the author was at the Department of Statistics at Carnegie Mellon University. The author thanks Stuart Geman, Jeffrey Miller, Lee Newberg, an associate editor and a referee for helpful comments.

Appendix

Proposal distributions

The simulation example in § 4.1 uses conditional inference, so the target and proposal distributions for each dataset i can depend on the observed values V = (V1, …, Vm) and the fact that there are r one-labels and mr zero labels, but cannot depend on the observed pairing of labels and values. We suppress the dependence on i in the notation. Let J = (J1, …, Jm) be a permutation that makes V nonincreasing, i.e., VJ1 ⩾ ⋯ ⩾ VJm. Choose a random permutation Π = (Π1, …, Πm) according to the distribution

pr(Π=π)=exp{θi=1r𝟙(πir)}r!(mr)!k=0m(rk)(mrrk)exp(θk)(θ;r=0,,m),(ab)=a!b!(ab)!,

where the binomial coefficients are defined to be zero if a < 0, b < 0 or a < b. Leaving V in the original observed order and permuting L so that LJΠ1 = ⋯ = LJΠr = 1 and LJΠr+1 = ⋯ = LJΠm = 0 gives a random pairing of values with labels. The case θ = 0 is the uniform distribution over permutations, i.e., the target null conditional distribution. We used θ = 3 for the proposal distribution that assigns higher probability to those permutations that tend to match the label one with larger values.

The simulation example in § 4.2 also uses conditional inference, so the target and proposal can depend on the observed margins, r and c, of the matrix X. For the target distribution Pθ defined in (10), we create the proposal Qθ by modifying the importance sampling algorithm described in the author’s 2009 unpublished preprint, arXiv:0906.1004v1, which is designed for the uniform distribution, i.e., θ = 0. Specifically, we use equation (3.12a) of that preprint, and we include the additional weights {exp(θνij)λij}bi inside the product in equation (3.6) of that preprint when sampling from the jth column, where λ is a solution to the system of equations

λij=(Nj)/[M=j+1Nexp(θνi){k=1Mλkjexp(θνk)}1](i=1,,M;j=1,,N1).

Proofs

We prove Theorem 1 with a succession of lemmas. The notation and conventions follow § 2 with the following additions. We use E to denote expectation. For any sequence y = (y0, …, yn) and any k ∊ {0, …, n}, let yk = (yk, y1, …, yk−1, y0, yk+1, …, yn), which is the sequence obtained by swapping the 0th element and the kth element in the original sequence y. Let Π denote a random permutation chosen uniformly from 𝒨 and chosen independently of Z. Let U = Z(Π). If π ∊ 𝒨 is a fixed permutation, then Π and Π(π) have the same distribution, which means U and U(π) have the same distribution. In particular, U and Uk have the same distribution. Lemma A1 is the key algebraic inequality.

Lemma A1. For all t0, …, tn [−, ] and all α, w0, …, wn [0, ],

k=0nwk𝟙{i=0nwi𝟙(titk)α}α.

Proof. Let H denote the left side of the desired inequality. We can assume that H > 0, since the statement is trivial otherwise. The pairs (t0, w0), …, (tn, wn) can be reordered without affecting the value of H, so we can assume that t0 ⩾ ⋯ ⩾ tn. This implies that i=0nwi𝟙(titk) is increasing in k and that there exists a k* defined as the largest k for which i=0nwi𝟙(titk)α. So, H=k=0k*wk=i=0k*wi𝟙(titk*)α.

Lemma A2. For any nonnegative, measurable function f on Sn+1,

E{f(Z)}=E{(n+1)1k=0nw(Yk)f(Yk)}.

Proof. Recall that Zi = YiQ for i ≠ 0 and that Z0 = XP. A change of variables from X to Y0 gives E {f (Z)} = E{w(Y0) f (Y)}. Since the distribution of Y is invariant to permutations, we have E{w(Y0) f (Y)} = E{w(Yk) f (Yk)} for each k = 0, …, n, which means that E{f(Z)}=E{w(Y0)f(Y)}=(n+1)1k=0nE{w(Yk)f(Yk)}. Moving the sum inside the expectation completes the proof.

Lemma A3. Let PZ and PU denote the distributions of Z and U, respectively, over Sn+1. Then PZPU and

dPzdPU(u)=(n+1)w(u0)j=0nw(uj)

almost surely.

Proof. Let g be any nonnegative, measurable function on Sn+1. It is enough to show that

E{g(Z)}=E{(n+1)w(U0)j=0nw(Uj)g(U)}. (A1)

For any π ∊ 𝒨, there is a unique inverse permutation π−1 ∊ 𝒨 with ππi1=ππi1=i, for each i = 0, …, n. Comparing with the proof of Lemma A2, for any nonnegative, measurable function f, we have

E{f(Z(π))}=E{f(Y(π))w(Y0)}=E{f(Y)w(Yπ01)}, (A2)

where the first equality is a change of variables from Z to Y and the second equality follows from the fact that the distribution of Y is permutation invariant, so, in particular, Y and Y(π−1) have the same distribution. Using (A2) gives

E{f(U)}=1(n+1)!π𝒨E{f(U)|Π=π}=1(n+1)!π𝒨E{f(Z(π))}=1(n+1)!π𝒨E{f(Y)w(Yπ01)}=1(n+1)!j=0nπ𝒨:π01=jE{f(Y)w(Yj)}=1(n+1)!j=0nn!E{f(Y)w(Yj)}=E{j=0nw(Yj)n+1f(Y)}. (A3)

Applying (A3) to the function f(u)=(n+1)g(u)w(u0)/j=0nw(uj) gives

E{(n+1)w(U0)j=0nw(Uj)g(U)}=E{j=0nw(Yj)n+1(n+1)w(Y0)j=0nw(Yj)g(Y)}=E{w(Y0)g(Y)}=E{g(Z)},

where the last equality is a change of variables as in (A2). This gives (A1) and completes the proof.

Lemma A4. For any nonnegative, measurable function f on Sn+1,

E{f(Z)}=E{k=0nw(Uk)j=0nw(Uj)f(Uk)}.

Proof. Changing variables from Z to U and using Lemma A3 gives

E{f(Z)}=E{(n+1)w(U0)j=0nw(Uj)f(U)}. (A4)

Since the distribution of U is invariant to permutations, we have

E{(n+1)w(U0)j=0nw(Uj)f(U)}=E{(n+1)w(Uk)j=0nw(Uj)f(Uk)}(k=0,,n). (A5)

Combining (A4) and (A5) and averaging over k gives

E{f(Z)}=E{(n+1)w(U0)j=0nw(Uj)f(U)}=1n+1k=0nE{(n+1)w(Uk)j=0nw(Uj)f(Uk)}.

Moving the sum inside the expectation and cancelling the (n + 1) terms completes the proof.

Proof of Theorem 1 for c = 1. Applying Lemma A2 to the function f (z) = 𝟙{1(z) ⩽ α} gives

pr{p˜1(Z)α}=E[𝟙{p˜1(Z)α}]=E[(n+1)1k=0nw(Yk)𝟙{p˜1(Yk)α}]=E((n+1)1k=0nw(Yk)𝟙[(n+1)1i=0nw(Yi)𝟙{t(Yi,Y)t(Yk,Y)}α]).

The quantity inside the final expectation is always at most α, which follows from Lemma A1 by taking t = t(Y, Y) and w = w(Y)/(n + 1) for each = 0, …, n.

Proof of Theorem 1 for c = 2. Applying Lemma A4 to the function f (z) = 𝟙{2(z) ⩽ α} gives

pr{p˜2(Z)α}=E[𝟙{p˜2(Z)α}]=E[k=0nw(Uk)j=0nw(Uj)𝟙{p˜2(Uk)α}]=E(k=0nw(Uk)j=0nw(Uj)𝟙[i=0nw(Ui)j=0nw(Uj)𝟙{t(Ui,U)t(Uk,U)}α]).

The quantity inside the final expectation is always at most α, which follows from Lemma A1 by taking t = t(U, U) and w=w(U)/{j=0nw(Uj)} for each = 0, …, n.

References

  1. Besag J, Clifford P. Generalized Monte Carlo significance tests. Biometrika. 1989;76:633–42. [Google Scholar]
  2. Bolviken E, Skovlund E. Confidence intervals from Monte Carlo tests. J Am Statist Assoc. 1996;91:1071–8. [Google Scholar]
  3. Brazzale AR. hoa: an R package bundle for higher order likelihood inference. Rnews. 2005;5:20–7. ISSN 609–3631, ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf. [Google Scholar]
  4. Brazzale AR, Davison AC. Accurate parametric inference for small samples. Statist Sci. 2008;23:465–84. [Google Scholar]
  5. Cox DR. The regression analysis of binary sequences. J. R. Statist. Soc. B. 1958;20:215–42. [Google Scholar]
  6. Cytel . LogXact. Cambridge, MA: Cytel Inc.; 2010. p. 9. [Google Scholar]
  7. Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge, UK: Cambridge University Press; 1997. [Google Scholar]
  8. Garthwaite PH, Buckland ST. Generating Monte Carlo confidence intervals by the Robbins–Monro process. J. R. Statist. Soc C. 1992;41:159–71. [Google Scholar]
  9. Green PJ. Discussion of the paper by Geyer and Thompson. J. R. Statist. Soc B. 1992;54:683–4. [Google Scholar]
  10. Hesterberg T. Weighted average importance sampling and defensive mixture distributions. Technometrics. 1995;37:185–94. [Google Scholar]
  11. Lehmann EL, Romano JP. Testing Statistical Hypotheses. 3rd edn. New York: Springer; 2005. [Google Scholar]
  12. Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]
  13. Mehta CR, Patel NR. Exact logistic regression: theory and examples. Statist Med. 1995;14:2143–60. doi: 10.1002/sim.4780141908. [DOI] [PubMed] [Google Scholar]
  14. Rasch G. On general laws and the meaning of measurement in psychology. In: Neyman J, editor. Proc 4th Berkeley Symp Math Statist Prob: Prob Theory. Vol. 4. Berkeley, CA: University of California Press; 1961. pp. 321–34. [Google Scholar]
  15. StataCorp . Stata Statistical Software: Release. College Station, TX: StataCorp LP; 2009. p. 11. [Google Scholar]
  16. Westfall PH, Young SS. Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. New York: Wiley-Interscience; 1993. [Google Scholar]
  17. Zamar D, McNeney B, Graham J. Elrm: software implementing exact-like inference for logistic regression models. J. Statist. Software. 2007;21:1–18. [Google Scholar]

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES