Abstract
We provide a simple and good approximation of power of the unconditional test for two correlated binary variables. Suissa and Shuster (1991) described the exact unconditional test. The most commonly used statistical test in this setting, McNemar’s test, is exact conditional on the sum of the discordant pairs. Although asymptotically the conditional and unconditional versions coincide, a long-standing debate surrounds the choice between them. Several power approximations have been studied for both methods (Miettinen, 1968; Bennett and Underwood, 1970; Connett, Smith, and McHugh, 1987; Connor, 1987; Suissa and Shuster, 1991; Lachenbruch, 1992; Lachin, 1992). For the unconditional approach most existing power approximations use the Gaussian distribution, while the accurate (“exact”) method is computationally burdensome.
A new approximation uses the F statistic corresponding to a paired-data T test computed from the difference scores of the binary outcomes. Enumeration of all possible 2 × 2 tables for small sample sizes allowed evaluation of both test size and power. The new approximation compares favorably to others due to the combination of ease of use and accuracy.
Keywords: 2 × 2 table, McNemar’s Test
1. INTRODUCTION
1.1 Motivation
In clinical trials one often faces the question of whether a binomial probability has changed due to treatment. If two binomial samples represent repeated measures then the resulting binomials are correlated. This situation is often referred to as the unconditional case in that neither row nor column marginal frequencies of the corresponding 2 × 2 table are fixed. Let N indicate the total number of pairs of observations. Table I summarizes notation for outcome frequencies, while Table II summarizes notation for outcome probabilities. Assuming ε(c/N) = π10 and ε(b/N) = π01 leads to testing H0: π10 = π01. Suissa and Shuster (1991) described how to compute p-values and power exactly for the unconditional test for two correlated binary variables.
TABLE I.
Outcome Counts
|
TABLE II.
Outcome Probablities
|
Most of the work with 2 × 2 tables of correlated pairs has depended on assuming fixed marginal counts, and hence has been referred to as a conditional approach. As with the unconditional approach one tests H0: π10 = π01, but under the assumptions that ε[c/(b + c)] = π10 and ε[b/(b + c)] = π01.
A long and sometimes rancorous debate has surrounded the choice of analysis for 2 × 2 tables. The conditional approach requires only the counts of the discordant pairs, which allows two or more distinct configurations to provide the same statistic and same rejection regions. For example, should we treat two tables equivalently that differ only in total sample size? We believe that the debate reduces to a choice of assumptions, and that the choice should match the test to the sampling scheme used in the study. As Dozier and Muller stated (1993) “Conditional tests arise from defining the sample space in terms of the data being analyzed, while unconditional tests arise from defining the sample space in terms of hypothetical replicates of the experiment that generated the data being analyzed.”
We seek a simple and accurate method for power analysis with the unconditional approach. Our interest arises from two sources. First, the approach appears appropriate for a wide range of data, including many applications in medical and behavioral science. Second, the extensive calculations needed for “exact” computations discourages more widespread use of the method.
In the unconditional case, we follow the lead of Suissa and Schuster (1991) and consider the McNemar statistic, which is the exact statistic for the conditional case. The difference between the conditional and unconditional settings lies in the description of the associated probabilities, and the corresponding computational difficulties. In the noncentral case the complexity arises due to the need to find the optimum of a function, with each value depending on a significant enumeration. See Suissa and Schuster (1991) for details.
1.2 Specification of the Problem
For correlated binary outcomes one usually seeks to test the difference between two proportions, each proportion representing the probability for which the matched pairs disagree. Often the two dichotomous outcomes differ only by recording time, such as pre- and post-treatment measurements. The hypothesis of interest centers on the probabilities of discordant pairs. Under H0ε(c/N) = π10 and ε(b/N) = π01. Notation in Tables I and II allows stating H0: π10 − π01 = δ = 0, or H0: π1· − π·1 = δ = 0. The second form arises because π1· = π10 + π11 and π·1 = π01 + π11.
1.3 Related Work
McNemar’s test, the exact conditional test for binary correlated pairs, converges asymptotically to the χ2 test. The exact statistic depends only on the discordant pairs: Qm = (b − c)2/(b + c). The conditional method requires a larger sample size to achieve a fixed power for a fixed difference than does the unconditional method. Hence applying the conditional method to the unconditional setting yields a conservative test.
Suissa and Sinister (1985) studied the test size and power of unconditional tests. They suggested applying a maximization method to a conditional test to provide a least conservative test. Maximizing the null power function over the domain of a nuisance parameter gives the worst possible configuration, and yields a test which is never liberal. Suissa and Shuster (1991) supported Frisen’s (1980) recommendation that the exact unconditional test be used, based on some unappealing properties of power for the exact conditional test. Frisen noted that under the conditional assumptions, the null hypothesis can be stated in terms of equivalent marginal probabilities (H0: π1· = π·1) or diagonals (H0: π10 = π01). However, regarding power, “… influence of π1·, π·1 under H0 on conditional power indicates that conditional power is not a suitable measure.”
Most existing power approximations for the unconditional approach depend on the Gaussian distribution (Connett, Smith, and McHugh, 1987; Connor, 1987; Lachenbruch, 1992; Lachin, 1992; Miettinen, 1968; Suissa and Shuster, 1991). Bennett (1970) used the χ2 goodness-of-fit statistic in terms of the multinomial likelihood. In addition, many unconditional sample size approximations are based on the conditional distribution under the null, and the unconditional under the alternative (Bennett and Underwood, 1970; Connett, Smith, and McHugh, 1987; Connor, 1987; Lachenbruch, 1992; Lachin, 1992; Miettinen, 1968). In contrast, both our approach (described in §2) and that of Suissa and Shuster (1991) use the unconditional distribution for both cases.
Suissa and Shuster (1991) described the exact unconditional test and computed p-values and power. As typically happens when starting with discrete random variables, the “exact” test merely guarantees test size no higher than the desired level. The test usually does not reach the target level test size of exactly a. Achieving a test with size as close to α as possible requires substantial computations. Naturally, power computations increase the burden. Hence power approximations have great appeal. The power of the best asymptotic method examined by Suissa and Shuster (1991) fluctuated as much as 14% in either direction from their computed “exact” power.
A similar problem occurs in comparing two independent binomial variables. The same conditional/unconditional distinction holds, and the same computational complexity arises. D’Agostino, Chase, and Belanger (1988) demonstrated that computing a T test on the outcomes coded as 1’s and 0’s leads to a very accurate approximation of the unconditional test in small samples. Dozier and Muller (1993) extended the results to the noncentral case by demonstrating similar excellent performance for power approximation. Asymptotically the exact test and approximations converge to the same test. In the same spirit, Lachenbruch (1992) mentioned using a paired-data T for testing correlated binomials.
2. A NEW METHOD
2.1 A New Approach for Power
We suggest approximating power of the unconditional test of correlated binary outcomes by using an appropriate paired-data T test. We do not recommend analyzing data in this way. The approach parallels that of Dozier and Muller (1993), and has the same two part motivation.
First, consider the contrast between using a Z test and a T test for the hypothesis of equality of Gaussian means. Asymptotically the two coincide. The T uses an additional parameter to account for the varying impact of a finite sample. For any particular design, the critical value for the T will always be larger than for the Z test. In turn, the power of the T will never be more than for the Z. Hence using a T rather than a Z will lead to a more conservative choice for sample size. In most applications a method that has modest conservatism, and rare optimism would be preferred over a method that balances optimistic and pessimistic values.
Second, the approximation suggested here shares the same desirable asymptotic features as existing methods. See Suissa and Shuster (1991) for a discussion of asymptotic properties of McNemar’s test, the unconditional test, and approximations. Standard arguments about multivariate linear models with independent and identically distributed observation vectors apply. The central limit theorem applies to the proportions interpreted as the sample means. Consideration of the alternative hypothesis involves examining a sequence of local alternatives (Sen and Singer, 1993, p238). the approach. The availability of simple asymptotically accurate approximations of power of the unconditional test leads to examining only small sample performance.
2.2 Calculation of the Test Statistic
Additional notation allows writing simple expressions for the statistic of interest. Let Eij ∈ {0, 1} indicate the outcome for the ith subject and j outcome. Define Di = Ei1 − Ei2, with Di ∈ {−1, 0, 1}. N1 = c equals the number of positive discordant pairs and Pr{Di = 1} = Pr{(Ei1 = 1) ∩ (Ei2 = 0)}. N−1 = b equals the number of negative discordant pairs, with Pr{Di = −1} = Pr{(Ei1 = 0) ∩ (Ei2 = 1)}. N0 = a + d equals the number of concordant pairs, with Pr{Di = 0} = Pr{[(Ei1 = 1) ∩ (Ei2 = 1)] ∪ [(Ei1 = 0) ∩ (Ei2 = 0)]}. We need not distinguish between a and d. Note that N1 + N−1 equals the number of discordant pairs and N1 + N−1 + N0 = N, the total number of pairs. Also note the following expressions for the sample mean and variance of D:
| (2.1) |
and
| (2.2) |
The new approach for a power approximation corresponds to the simple notion of performing a paired-data T test of the difference in outcome variables coded as 1’s and 0’s. It will be convenient to describe the test in the equivalent form of a one sample F test of mean difference. if then express the observed statistic of interest, corresponding to the usual least squares and Gaussian theory test, as
| (2.3) |
Only two special cases lead to . The case with a + d = N (and b = c = 0), yields no discordant pairs and . Set Fobs = 0 with p-value of 1, and do not reject the null hypothesis. The case with one discordant cell count of N (either b = N or c = N) yields all discordant pairs., Set Fobs = ∞, with p-value of zero, and reject the null hypothesis.
2.3 Approximating Power
The test just described allows approximating power very easily. With δ = π10 − π01 and ψ = π10 + π01 define
| (2.4) |
Let FF(f; ν1, ν2, ω) indicate the cumulative distribution function of a noncentral F random variable with degrees of freedom ν1 for the numerator, ν2 for the denominator, and noncentrality ω. In turn let FF−1(1 − α; ν1, ν2, 0) = fcrit indicate the (1 − α) central quantile. Approximate power of a two-sided test with
| (2.5) |
For a one-sided test replace (1 − α) by (1 − 2α) in calculating fcrit.
3. ENUMERATION STUDIES
3.1 Enumeration Methods
Here we describe the enumeration of small sample behavior of the proposed statistic under the null and alternative hypothesis. Enumeration was preferred to simulation because it produces exact results. We calculated probabilities based on a trinomial distribution, for every possible 2 × 2 configuration given a number of total pairs (N), for a range of π10 and π01, for both the null and non-null cases.
Using a one-sided test leads to observing only whether N1 ≥ N−1. Critical values for the nominal α from the F distribution were used to evaluate each configuration for significance. The probabilities of the significant configurations were summed to give the attained test size (Table III) and sample size approximations (Table IV). Write the probability of a particular configuration as
| (3.1) |
The special case of the null hypothesis has π10 = π01 = π and reduces (3.1) to
| (3.2) |
Note that when one assumes the null hypothesis is true, the distribution is symmetric. Send e-mail to the first author (GSelicat@Quintiles.Com) for a copy of the SAS® (version 6.08) program used for the enumerations.
Table III.
Maximum Attained Test Size of the F Test Compared to “Exact” Test Actual Size from Suissa and Shuster (1991)
| N | α = .01 | α = .025 | α = .05 | ||||||
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| π | F | “Exact” | π | F | “Exact” | π | F | “Exact” | |
|
|
|
|
|||||||
| 10 | .292 | .0132 | .0099 | .241 | .0265 | .0208 | .463 | .0652 | .0265 |
| 20 | .314 | .0119 | .0071 | .347 | .0287 | .0246 | .498 | .0557 | .0396 |
| 40 | .161 | .0116 | .0080 | .030 | .0269 | .0250 | .127 | .0527 | .0500 |
| 80 | .081 | .0115 | .0094 | .151 | .0267 | .0234 | .064 | .0522 | .0499 |
Table IV.
Minimum Sample Sizes Needed for Power of .80
|
α = .01
|
α = .025
|
α = .05
|
|||||
|---|---|---|---|---|---|---|---|
| δ | ψ | NS | NF | NS | NF | NS | NF |
| 0.10 | 0.15 | 142 | 144 | 107 | 112 | 86 | 88 |
| 0.20 | 196 | 194 | 152 | 152 | 118 | 119 | |
| 0.30 | 185 | 181 | |||||
| 0.20 | 0.25 | 57 | 56 | 42 | 44 | 36 | 34 |
| 0.30 | 69 | 68 | 53 | 53 | 43 | 42 | |
| 0.40 | 98 | 94 | 76 | 73 | 63 | 58 | |
| 0.50 | 126 | 119 | 97 | 93 | 76 | 73 | |
| 0.60 | 151 | 144 | 118 | 112 | 93 | 88 | |
| 0.70 | 177 | 169 | 136 | 132 | 108 | 104 | |
| 0.80 | 194 | 159 | 152 | 124 | 119 | ||
| 0.90 | 179 | 171 | 139 | 135 | |||
| 0.30 | 0.35 | 34 | 32 | 26 | 25 | 21 | 20 |
| 0.40 | 41 | 38 | 31 | 30 | 26 | 23 | |
| 0.50 | 53 | 49 | 42 | 38 | 34 | 30 | |
| 0.60 | 65 | 60 | 51 | 47 | 40 | 37 | |
| 0.70 | 76 | 71 | 60 | 56 | 49 | 44 | |
| 0.80 | 89 | 82 | 69 | 64 | 55 | 51 | |
| 0.90 | 101 | 94 | 78 | 73 | 65 | 58 | |
| 0.40 | 0.45 | 23 | 21 | 18 | 17 | 15 | 13 |
| 0.50 | 28 | 25 | 21 | 19 | 17 | 15 | |
| 0.60 | 36 | 31 | 26 | 24 | 23 | 19 | |
| 0.70 | 41 | 37 | 32 | 29 | 27 | 23 | |
| 0.80 | 49 | 43 | 38 | 34 | 32 | 27 | |
| 0.90 | 54 | 50 | 43 | 39 | 35 | 30 | |
| 0.50 | 0.55 | 17 | 15 | 15 | 12 | 11 | 9 |
| 0.60 | 21 | 17 | 16 | 14 | 13 | 11 | |
| 0.70 | 25 | 21 | 20 | 17 | 17 | 13 | |
| 0.80 | 30 | 25 | 23 | 20 | 19 | 16 | |
| 0.90 | 35 | 29 | 27 | 23 | 22 | 18 | |
| 0.60 | 0.65 | 14 | 11 | 11 | 9 | 10 | 7 |
| 0.70 | 15 | 13 | 12 | 10 | 11 | 8 | |
| 0.80 | 20 | 16 | 15 | 12 | 13 | 10 | |
| 0.90 | 23 | 18 | 19 | 14 | 16 | 11 | |
3.2 Null Case Results
The test size attained by using the approximate unconditional statistic was compared with Suissa and Shuster’s (1991) results. Table III contains results for a one-sided test with α ∈ {.01, .025, .05}. Since symmetry holds under the null case, a two-sided test with these values can be applied for a test with α ∈ {0.02, .05, .10}. Values of π very near .50 represent “the highly unlikely and practically impossible scenario of the most negative correlation between [the two outcomes]” (Suissa and Shuster, 1991). Table III contains the attained test size with a supremum of π selected from the interval (0, .995), with a precision of .001, the same method as Suissa and Shuster (1991). The columns labeled F and “Exact” contain results for the approximation proposed here and for the test described by Suissa and Shuster. Table III also contains the π at which the supremum occurred. For computational convenience, .498 was used as the π upper limit rather than the .4975 as used by Suissa and Shuster. The attained test size of the approximate test was a bit liberal, relative to the target α. Some liberality remains even with N = 80. The discrete nature of the data leads to the “exact” test usually being somewhat conservative. The test sizes of both tests grow closer to the nominal value as N increases.
3.3 Alternative Case Results
Table IV contains approximate sample sizes for the minimal number of pairs needed to attain power of at least .80. For a one-sided test, a nominal significance level of .01, .025, or .05 was used. The column labeled NF contains sample sizes suggested by the approximate F approach, while the column labeled NS contains sample sizes for the exact unconditional approach (Suissa and Sinister, 1991). Missing sample sizes have N > 200. Table IV results have NF ≤ NS (with the exception of some results for ψ < 0.25), which agrees with the test size results in Table III. The approximation usually provides a sample size an average of 2–3 units smaller than NS. The new approximation appears to have less optimism than the suggestion of Miettinen (1968), and fluctuated less from the exact value than the approximation studied by Connett, Smith, and McHugh (1987) and Conner (1987). The exact conditional sample size proves very conservative as an approximation for the unconditional case (Suissa and Shuster, 1991). Overall, the accuracy of the power approximation improves with sample size and as power increases from .80 to .90 (results were computed but not tabled for .90).
4. DISCUSSION
4.1 Discussion of Results
Overall we saw a maximum of 8 and an average of 2–3 units of optimism in sample size approximations, as compared with the exact unconditional method (Suissa and Shuster, 1991). Hence we recommend merely increasing the approximate sample size by 3 units. The power optimism stems from the test size optimism. Consequently another approach would be to reduce the nominal test size, such as by using α · [1 − (2/N)]. Further research would be needed to develop and evaluate any such modification.
The algorithm used for the enumeration studies could be used to compute an exact version of the approximate test describe here. The computational burden would be modest for current desktop personal computers. The test would be exact in a similar sense as the test described by Suissa and Shuster: test size would be guaranteed to be no more than the nominal size, although usually less. Perhaps more importantly, the approach might be generalized to allow two or more groups with two correlated binary responses, or three or more repeated binary responses measures. Note that Agresti (1991) reported a method due to O’Brien for approximating power for general categorical data models.
4.2 Using the Power Approximation
The SAS® code below uses the F statistic to find approximate power.
DATA POWER;
DELTA = 0.20;
PSI = 0.45;
ALPHA = 0.05;
N = 91;
FCRIT = FINV(1-2*ALPHA,1,N − 1);
OMEGA = [N*(DELTA**2)]/[PSI − DELTA**2)];
POWER = 1 − PROBF(FCRrr, 1, N − 1, OMEGA);
The code assumes a one-sided test. For a two-sided test replace “2*ALPHA” with “ALPHA”. The example code gives FCRIT = 2.7621, OMEGA = 8.8781, and POWER = .9053. The computational efficiency of the approximation allows conveniently examining a wide range of scenarios. For example, a plot of power for a range of alternatives provides a very informative display and one extremely well received by scientists.
4.3 Conclusions and Recommendations
When demanded by the sampling situation, we suggest testing the hypothesis for correlated binary pairs with the exact unconditional test. The F approximation described here provides a convenient and reasonably accurate method of approximating the corresponding power, with adjustments as noted above.
Acknowledgments
An earlier version of this paper was submitted by the first author in partial fulfillment of the requirements for the M. S. degree in Biostatistics. Muller’s work supported in part by NCI program project grant P01 CA47 982-04, NIH Clinical Research Center grant M01 RR000-46-33, and NIEHS grant N01-ES-35356. The authors gratefully acknowledge helpful comments on an earlier draft of this paper by Lisa M. LaVange and Susan Kenny.
Contributor Information
Grace R. Selicato, Statistical Programming Systems, Quintiles, Inc., PO Box 13979, RTP, North Carolina, 27709-3979
Keith E. Muller, Dept. of Biostatistics, CB#7400, University of North Carolina, Chapel Hill, North Carolina, 27599-7400
BIBLIOGRAPHY
- Agresti A. Categorical Data Analysis. New York: Wiley; 1991. [Google Scholar]
- Bennett BM, Underwood RE. On McNemar’s test for the 2 × 2 table and its power function. Biometrics. 1970;26:339–343. [Google Scholar]
- Connett JE, Smith JA, McHugh RB. Sample size and power for pair-matched case-control studies. Statistics in Medicine. 1987;6:53–59. doi: 10.1002/sim.4780060107. [DOI] [PubMed] [Google Scholar]
- Connor RJ. Size for testing differences in proportions for the paired sample design. Biometrics. 1987;43:207–211. [PubMed] [Google Scholar]
- D’Agostino RB, Chase W, Belanger A. The appropriateness of some common procedures for testing the equality of two independent binomial populations. The American Statistician. 1988;42:198–202. [Google Scholar]
- Dozier WG, Muller KE. Small-sample power of uncorrected and Satterthwaite corrected t tests for comparing binomial proportions. Communications in Statistics: Simulation and Computation. 1993;22:245–264. [Google Scholar]
- Frisen M. Consequences of the use of conditional inference in the analysis of a correlated contingency table. Biometrika. 1980;67:23–30. [Google Scholar]
- Lachenbruch PA. On the sample size for studies based upon McNemar’s test. Statistics in Medicine. 1992;11:1521–1525. doi: 10.1002/sim.4780111110. [DOI] [PubMed] [Google Scholar]
- Lachin JM. Power and sample size evaluation for the NcNemar test with application to matched case-control studies. Statistics in Medicine. 1992;11:1239–1251. doi: 10.1002/sim.4780110909. [DOI] [PubMed] [Google Scholar]
- Miettinen OS. The matched-pairs design in the case of all-or-none responses. Biometrics. 1968;24:339–352. [PubMed] [Google Scholar]
- Suissa S, Shuster JJ. Exact unconditional sample sizes for the 2 × 2 binomial trial. Journal of the Royal Statistical Society. 1985;A 148:317–327. [Google Scholar]
- Suissa S, Shuster JJ. The 2 × 2 matched pairs trial: exact unconditional design and analysis. Biometrics. 1991;47:361–372. [PubMed] [Google Scholar]
- Sen PK, Singer JM. Large Sample Methods in Statistics: An Introduction with Applications. New York: Chapman and Hall; 1993. [Google Scholar]
