Abstract
We consider the non-inferiority (or equivalence) test of the odds ratio (OR) in a crossover study with binary outcomes to evaluate the treatment effects of two drugs. To solve this problem, Lui and Chang (2011) proposed both an asymptotic method and a conditional method based on a random effects logit model. Kenward and Jones (1987) proposed a likelihood ratio test (LRTM) based on a log linear model. These existing methods are all subject to model misspecification. In this paper, we propose a likelihood ratio test (LRT) and a score test that are independent of model specification. Monte Carlo simulation studies show that, in scenarios considered in this paper, both the LRT and the score test have higher power than the asymptotic and conditional methods for the non-inferiority test; the LRT, score and asymptotic methods have similar power and they all have higher power than the conditional method for the equivalence test. When data can be well described by a log linear model, the LRTM has the highest power among all the five methods (LRTM, LRT, score, asymptotic and conditional) for both non-inferiority and equivalence tests. However, in scenarios for which a log linear model does not describe the data well, the LRTM has the lowest power for the non-inferiority test and has inflated type I error rates for the equivalence test. We provide an example from a clinical trial that illustrates our methods.
Keywords: crossover study, equivalence test, likelihood ratio test, non-inferiority test, score test
1. Introduction
The crossover study has a long history in clinical trials and has been widely used to compare the effects of a new treatment and an existing treatment particularly for relatively stable chronic diseases such as asthma and hypertension. Unlike a conventional parallel-group trial, in a crossover study, each patient serves as his/her own control. Thus, crossover studies avoid the need to control for confounding variables (e.g. age and sex) and increase the efficiency of study. Crossover studies also require fewer subjects compared to the corresponding parallel group study; and thus reduce the costs of recruiting more subjects, especially when subjects are scarce or expensive to obtain. In the two-period crossover study, subjects are randomly selected to enter one of two sequences: 1). they receive treatment A followed by treatment B; 2). they receive treatment B followed by treatment A. There usually is a washout period between the two treatments to reduce the chance that the effect of the first treatment is carried over to the second treatment. The treatment effect of the drugs, period effect and carryover effect are of interest. In this paper, our primary focus is the testing problem for the treatment effect and we assume there is no carryover effect. We hypothesize that the efficacy of the new treatment is not worse (not inferior) or equivalent to the efficacy of the standard treatment and that the new treatment has other advantages (for example, less toxic, lower cost or easier to carry out, etc.).
The binary outcome crossover study is considered here. The complete data that would result from such a study can be summarized in a 2 by 4 table given in Table 1. The subjects who receive the treatment in the order AB will be denoted as group 1 and those in the order BA as group 2. In the table, the column name is defined as the responses of two treatments. Here “1” indicates positive response and “0” indicates negative response. For example, the pair (1,0) in group 1 indicates that the response is positive for treatment A and negative for treatment B. The entry , for example, is the number of patients in group 1 who had a (1,0) response. The other entries in the table are defined in a similar way and the sizes of the two groups are given by the marginal totals N1 and N2. Associated with each entry is the corresponding probability that the patient has that outcome.
Table 1.
Treatment Groups |
Ordered | Responses | Total | ||||||
---|---|---|---|---|---|---|---|---|---|
(0,0) | (0,1) | (1,0) | (1,1) | ||||||
1(AB) | N1 | ||||||||
2(BA) | N2 |
The risk difference (RD) between the two treatments has been used to test the non-inferiority (or equivalence) of the new treatment versus the standard treatment in this setting [1, 2]. However, the non-inferiority (or equivalence) margin for the RD depends heavily on the response rate for the standard treatment, which makes it difficult to select a fixed constant non-inferiority (or equivalence) margin. To alleviate this concern, the odds ratio (OR) has been recommended by Lui and Chang [3] and Gart and Thomas [4] as an alternative measure for tests of non-inferiority (or equivalence) for binary outcomes.
In order to provide statistical inference for the crossover study, different models have been proposed. These include the random effects logit model (Ezzet and Whitehead [5]) and the log linear model (Kenward and Jones [6]). Interestingly, in both models, the treatment effect can be estimated by the OR from the 2 by 4 table (see Table 1) without model assumptions about the entry proportions. Lui and Chang [3] proposed both an asymptotic method and a conditional method to test non-inferiority (or equivalence) of the OR based on the random effects logit model in Ezzet and Whitehead [5]. Kenward and Jones [6] provided a likelihood ratio test (LRTM) to test the OR based on the log linear model, that is equivalent to the logit model [5] when the random effects are ignored. All these methods are subject to model misspecification. In this paper, we propose a likelihood ratio (LRT) and a score test to evaluate the non-inferiority (or equivalence) of the OR without these model assumptions. Our method is model free, and thus is more robust to model misspecification and provides extra efficiency for tests of the OR.
We introduce these non-inferiority and equivalence tests for the OR in Section 1. In Section 2, we provide the statistical framework and introduce the methods used in Lui and Chang [3] and Kenward and Jones [6]. Then we introduce our proposed LRT and score test methods in Section 3. We compare the type I error rates and power for all of these methods using Monte Carlo simulation in Section 4 and provide the sample size calculation in Section 5. An example is given in Section 6. Finally, we discuss the results and provide some recommendations in Section 7.
2. Statistics Framework and Model Based Methods [3, 6]
For the first row (AB) of Table 1, the four random cell counts with sum N1 are assumed to have a multinomial distribution with probabilities . Similarly, the second row (BA) cell counts with sum N2 are assumed to have multinomial distribution . Let be the OR of a positive response rate of B over A. Please notice that, in Lui and Chang’s [3], they defined the OR as the square root of the OR defined here. In this paper, we consider the following non-inferiority test (1) [3]
(1) |
(2) |
where 0 < ϕl < 1 and we set ϕl = 0.5 [3]. We also set ϕu = 1/ϕl in this paper.
To test (1) and (2), Lui and Chang [3] used a random effects logit model; Kenward and Jones [6] used a log linear model. When the random effect terms are ignored in Lui and Chang’s logit model [3], the logit model is equivalent to Kenward’s log linear model [6]. For convenience in this paper, we only consider the log linear model [6].
In the section below on simulation, we describe some scenarios for which the log linear model does not adequately describe the data. In such scenarios, methods based on the log linear model lose power for tests of the OR due to loss of efficiency for the non-inferiority test and have high type I error inflation for the equivalence test. In Section 3, we provide a likelihood ratio test (LRT) and a score test which do not depend on any model assumptions.
3. Test Statistics
3.1. Likelihood Ratio Test (LRT) Statistic
We first consider the LRT statistic for non-inferiority test (1).
Suppose we have a 2 × 2 binary outcome sample ; i, j = 0, 1; k = 1, 2 with , k = 1, 2 as in Table 1. Assume ; i, j = 0, 1; k = 1, 2 with natural constraints ; k = 1, 2.
Let , the likelihood function is given as:
where , k = 1, 2 are two constraints for parameters, and . When we take the logarithm on both sides, we have
With the following reparameterization
(3) |
and
we have
This is a function of six independent parameters where ϕ is parameter of interest.
Let and . And A = −a + b + c; B = a * m2 − 2cm2; . Then the restricted maximum likelihood estimate (RMLE) of under ϕ = ϕl is the smaller root of the quadratic equation . The RMLE’s of the other parameters are given by . The unrestricted maximum likelihood estimates (MLE’s) are ; i, j = 0, 1; k = 1, 2.
Consider the following form of LRT statistic:
that can be calculated by using the above estimates of RMLE’s and unrestricted MLE’s.
If ϕ < ϕl, then LRT → 0; if ϕ = ϕl, “the asymptotic distribution of LRT is that of a chance variable which is zero half the time and which behaves like χ2 with one degree of freedom the other half of the time” [7]. Denote δ(0) as the distribution of the random variable with probability mass 1 at point zero. Then the random variable with distribution is non-negative. To be conservative, will be used to calculate p-values for the LRT.
In order to do equivalence test (2), we conduct two non-inferiority tests Ha0 : ϕ ≤ ϕl versus Ha1 : ϕ > ϕl and Hb0 : ϕ ≥ 1/ϕl versus Hb1 : ϕ < 1/ϕl by using two one-sided tests procedure [8].
3.2. Score Test Statistic
Non-inferiority score test is considered first. In order to deduce this score test, we need to obtain the information matrix first (see detailed calculation in Appendix). The information matrix can be partitioned as
(4) |
where the elements is a scalar, is a 5 × 1 matrix, is a 5×5 symmetric matrix.
Let β̂T be the RMLE under the null hypothesis. Then the general score test for testing H0 can be computed as
where the score vector is given by:
From equation (4), the inverse of the Fisher information matrix for ϕ is given by:
Under the null hypothesis, the asymptotic distribution of the score statistic is chi-squared with one degree of freedom.
As in LRT, we also use two one sided tests procedure [8] to conduct the equivalence score test (2).
4. Monte Carlo Simulation
We conducted a simulation study to examine the type I error rates and power of the proposed LRT and score test, the existing asymptotic method [3], conditional method [3] and LRTM [6] under the following three different scenarios. In Scenario 1), data comes from log linear model [6] with the basic probability of success set to be 0.2 and the period effect set to be 0.5; in Scenario 2), we set ; in Scenario 3), . We evaluated the fit of the log linear model [6] in scenarios 2 and 3 by deviance goodness of fit tests. In the scenarios considered, the log linear model did not adequately describe the data.
We take Scenario 1 (non-inferiority test) as an example to illustrate our simulation procedure. For a given sample size (N1, N2) and a true odds ratio, we generated 10,000 repeated samples from the log linear model with basic probability of success 0.2 and period effect 0.5. We calculated the theoretical p-values for LRT and score test using the asymptotic distribution under the null derived in Section 3. We used as the asymptotic null distribution for the LRT and used as the asymptotic null distribution for the score test. Then, by computing the proportion of times for which the null hypothesis was rejected (p ≤ 0.05), we obtained the estimated type I error rate when the true ϕ ≤ ϕl and power when true ϕ > ϕl for all the five methods. Then, similar procedures were used for data generated from Scenarios 2 and 3. Finally, we summarized the type I error rates and power for all methods based on scenario 1 in Figure 1 and those based on Scenarios 2 and 3 in Figure 2. Similar procedures were conducted in the equivalence test. The simulation results for equivalence test based on Scenario 1 are shown in Figure 3 and those based on Scenarios 2 and 3 are shown in Figure 4. This simulation study was conducted using R software.
Figures 1 and 2 show that, for the non-inferiority test, all methods can maintain the nominal type I error in these three scenarios. We note from Figures 1 and 2 that, our LRT and score test methods achieve greater power than the asymptotic method of Lui and Chang [3] (not to be confused with the asymptotic distribution of the LRT and score test). The larger the sample size, the closer the power of the asymptotic method is to the power of our methods. We also note that our LRT and score test methods and the Lui and Chang’s asymptotic method always have greater power than the conditional method. It is well known that the conditional test method is conservative and hence loses power. The LRT, score and asymptotic test methods are generally more efficient than the conditional test method. The most interesting observation is the behavior of the LRTM. When the data can be described by a log linear model, the LRTM method based on the model has greater power than all other methods (See Fig 1 for Scenario 1). However, when the data cannot be described by the log linear model, the LRTM method based on the log linear model loses power. In particular, the power is even lower than the asymptotic and conditional methods as shown in Figure 2 for Scenarios 2 and 3.
We did simulations to investigate the relationship between goodness of fit of the log linear model and power loss of the LRTM compared to the LRT (defined as (power of LRT-power of LRTM)/power of LRTM) when a log linear model does not describe data well. In Scenario 2, for a given true odds ratio, we simulated 10,000 repeated samples, fit a log linear model to each sample to obtain the deviance of the model fitting, and calculated the mean of these deviances to estimate the average goodness of fit for the log linear model. For the same true odds ratio, we also calculated the powers of LRTM and LRT to obtain the power loss. We did calculations on 50 true odds ratios with N1 = N2 = 50. By doing linear regression for the power losses on the corresponding deviances, we found that, there is a significant increase of power loss with the increase of deviance. That is to say, the worse the fit of the log linear model, the more power loss of the LRTM compared to our proposed LRT for the non-inferiority test. Since our score test has similar power as the LRT, we also expect that, the worse the fit of the log linear model, the more power loss of the LRTM compared to the score test.
Figures 3 and 4 show the simulation results for equivalence test. The LRT, score and asymptotic methods have similar power in all scenarios considered and they all outperform the conditional test. In Scenario 1 when the data can be described by a log linear model, as in the non-inferiority test, LRTM has the greatest power (See Figure 3). However, in Scenarios 2 and 3 for which the data cannot be described by a log linear model, the LRTM has high type I error rate inflation (see Figures 3 and 4). Furthermore, LRT, score, asymptotic and conditional methods all obtain the highest power at true ϕ = 1 as expected, while LRTM does not. From this inconsistent behavior of the LRTM, we can see that it cannot be used in scenarios for which a log linear model does not describe data well.
5. Sample Size Calculation
Sample sizes required for 80% power in Scenario 2 with are shown in Table 2 for non-inferiority test and Table 3 for equivalence test.
Table 2.
Test | Statistics | ||||
---|---|---|---|---|---|
True ϕ | LRT | Score | Asym | Conditional | LRTM |
0.8 | 1362 | 1344 | 1376 | 1476 | 1804 |
0.9 | 907 | 902 | 904 | 1000 | 1312 |
1.0 | 676 | 676 | 684 | 760 | 1066 |
1.1 | 548 | 538 | 542 | 614 | 870 |
1.2 | 458 | 448 | 454 | 516 | 776 |
Table 3.
Test | Statistics | ||||
---|---|---|---|---|---|
True ϕ | LRT | Score | Asym | Conditional | LRTM |
0.8 | 1370 | 1344 | 1366 | 1484 | 1812 |
0.9 | 1022 | 1005 | 1020 | 1090 | 1320 |
1.0 | 969 | 962 | 968 | 1064 | 1052 |
1.1 | 1092 | 1066 | 1085 | 1202 | 890 |
1.2 | 1410 | 1406 | 1410 | 1516 | 831 |
As expected, for the non-inferiority test, the required sample size decreases as ϕ increases. Obviously, the sample sizes required by LRTM are larger than the sample sizes required by other methods because in this scenario, the log linear model does not fit the simulated data. The sample sizes obtained for the LRT, score and asymptotic method are necessarily smaller than conditional method which is conservative. Furthermore, the sample sizes obtained for LRT and score methods are comparable. For the equivalence test, the required sample size is highest when true ϕ = 1 for all methods except the LRTM. The LRT, score and asymptotic methods require similar sample size for all situations. The conditional method still requires greater sample size than the LRT, score and asymptotic methods. The LRTM method also behaves poorly as it does in the Monte Carlo simulation study.
6. Clinical Trial Example
Consider the example conducted by 3M-Riker in Lui and Chang’s [3]. This crossover study was designed to compare two inhalation devices (A and B) delivering salbutamol [5]. The randomized 139 patients in Group 1 used device A followed by device B and the 140 patients in Group 2 used the devices in reverse order. Patients were asked to evaluate the features of each device and to respond either “Yes” or “No” to each device. The summary of patients’ responses was listed in Table 4. A “1” represents a “Yes” response and a “0” represents a “No” response. We are interested in testing the non-inferiority (or equivalence) of device A versus device B with respect to the patient preference rate (instead of device B versus A) [3].
Table 4.
Treatment Groups |
Ordered | Responses | Total | ||
---|---|---|---|---|---|
(0,0) | (0,1) | (1,0) | (1,1) | ||
1(AB) | 57 | 15 | 41 | 26 | 139 |
2(BA) | 54 | 32 | 16 | 38 | 140 |
Suppose we choose a clinically acceptable non-inferior margin 0.8 for the OR. When we conducted a non-inferiority test for the OR on this study, we obtained the p-values 1.67 × 10−6, 1.96 × 10−6, 4.39 × 10−6, 3.68 × 10−6, 1.09 × 10−5 for the LRT, score, asymptotic, conditional and LRTM respectively. All these small p-values show strong evidence that the patients’ preference rate for device A is non-inferior to that of device B. When fitting a log linear model on this data, we obtained deviance (p < 0.001). Thus, a log linear model does not describe the data well and for this reason the LRTM has a greater p-value than LRT and score methods.
For the equivalence test, all the LRT, score, asymptotic, conditional methods do not reject the null hypothesis that the patients’ preference rates for devices A and B are different. Due to the high type I error inflation rates for the LRTM when a log linear model does not fit the data, it is not appropriate to apply LRTM in this study.
7. Discussion
In this paper, we proposed a likelihood ratio test and a score test to solve the non-inferiority (or equivalence) testing problem for the odds ratio in a crossover study. Both methods are independent of model assumptions. We compared our tests with Lui and Chang’s asymptotic method and conditional method [3] that are based on random effects model. For the non-inferiority test, our proposed LRT and score tests achieve higher power than asymptotic [3] and they have closer and more comparable power as the sample size gets larger. For the equivalence test, the LRT and score and asymptotic methods have similar power. This occurs because the asymptotic method is actually a Wald test. Engle [9] showed that, the larger the sample size, the closer the power of all three tests because they are asymptotically equivalent. We also compared the LRT and score tests to Kenward’s LRTM method which is based on a log linear model assumption [6]. The LRTM achieves higher power than the LRT and score test when the log linear model holds; but behaves poorly when the log linear model does not hold. From the Neyman-Pearson Lemma, LRTM is the most powerful test when the log linear model holds, but the LRTM loses good behavior when this model assumption does not hold due to the loss of precision in the estimation of parameters.
We focused on treatment effects for a crossover study in our paper. If we use the which results from switching and in (3), we can extend our LRT and score methods to the non-inferiority (or equivalence) test to incorporate period effects.
The LRT and score test methods in this paper can only be used for crossover study with two periods. It will be an interesting topic to do further research on expanding them to crossover study with more than two periods.
Supplementary Material
Acknowledgments
Judith D. Goldberg, Sc.D was partially supported by the NYU CTSA Grant UL1TR000038 from the National Center for Advancing Translational Sciences (NCATS), NIH.
Appendix
Information Matrix for the Score Test
Denote
and
Then the elements of the information matrix are
References
- 1.Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. 3rd. New York: Wiley; 2003. [Google Scholar]
- 2.Fleiss JL. The Design and Analysis of Clinical Experiments. New York: Wiley; 1986. [Google Scholar]
- 3.Lui KJ, Chang KC. Test non-inferiority (and equivalence) based on the odds ratio under a simple crossover trial. Statistics in Medicine. 2011;30:1230–1242. doi: 10.1002/sim.4166. [DOI] [PubMed] [Google Scholar]
- 4.Gart JJ, Thomas DG. Numerical results on approximate confidence limits for the odds ratio. Journal of the Royal Statistical Society B. 1972;34:441–447. [Google Scholar]
- 5.Ezzet F, Whitehead J. A random effects model for binary data from crossover clinical trials. Applied Statistics. 1992;41:117–126. [Google Scholar]
- 6.Kenward MG, Jones B. A log-linear model for binary cross-over data. Applied Statistics. 1987;36:192–204. [Google Scholar]
- 7.Chernoff H. On the distribution of the likelihood ratio. Annals of Mathematical Statistics. 1954;25:573–578. [Google Scholar]
- 8.Liu JP, Weng CS. Bias tow one-sided tests procedures in assessment of bioequivalence. Statistics in Medicine. 1995;14:853–861. doi: 10.1002/sim.4780140813. [DOI] [PubMed] [Google Scholar]
- 9.Engle RF. Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics. In: Intriligator MD, Griliches Z, editors. Handbook of Econometrics II. Elsevier; 1983. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.