Abstract
The relative potency of one agent to another is commonly represented by the ratio of two quantal response parameters; for example, the LD50 of animals receiving a treatment to the LD50 of control animals, where LD50 is the dose of toxin that is lethal to 50% of animals. Though others have considered interval estimators of LD50, here, we extend Bayesian, bootstrap, likelihood ratio, Fieller’s and Wald’s methods to estimate intervals for relative potency in a parallel-line assay context. In addition to comparing their coverage probabilities, we also consider their power in two types of dose designs: one assigning treatment and control the same doses versus one choosing doses for treatment and control to achieve same lethality targets. We explore these methods in realistic contexts of relative potency of radiation countermeasures. For larger experiments (e.g., ≥ 100 animals), the methods return similar results regardless of the interval estimation method or experiment design. For smaller experiments (e.g., < 60 animals), Wald’s method stands out among the others, producing intervals that hold closely to nominal levels and providing more power than the other methods in statistically efficient designs. Using this simple statistical method within a statistically efficient design, researchers can reduce animal numbers.
Keywords: power; sample size; parallel-line assay; dose reduction factor, 3Rs
1. INTRODUCTION
Regulatory agencies like the U.S. Food and Drug Administration (2011), the U. S. Department of Agriculture (2002), and World Health Organization (Wilbur, Aubert, 1996) require measures of relative potency (RP) on certain pharmaceuticals like radiation countermeasures and vaccines. However, these agencies provide little guidance on statistical methods for producing the required measures and especially interval estimates of these measures. Consequently, research reports often fail to include measures of uncertainty or even a statistical test when reporting RP in subject specific journals [see Landes, Lensing, Kodell, Hauer-Jensen, 2013 for a review].
In radiation countermeasure research and in vaccine research and production, estimating RP often requires administering toxin (e.g., radiation or viral load) in increasing amounts to groups of control and treated animals, and then recording how many animals survive to the end of the experiment. RPs are typically ratios of treatment and control potency values. For binary outcomes analyzed with probit or logistic regression given log-dose, estimated RPs implicitly assume dose-responses for control and treatment groups are parallel. Hence, the experiment conducted to estimate RP is a parallel-line assay. For these types of experiments, we compare statistical properties of interval estimators of RP, and make recommendations on which estimators to use.
Though several researchers have developed methods to make statistical inference on RP, most have concentrated on special cases. A few, though, have considered more than one method. Williams (1988) developed an exact confidence interval method appropriate for multivariate normal response vectors coming from either parallel-line or slope-ratio models, and compared it to likelihood ratio intervals and a modified version of the likelihood ratio interval on a single data set; no comparison of statistical properties was provided. Iturria (2005) considered three different tests (but not interval estimates) of RP when the responses are correlated between the control and treatment group, and are modeled with nonlinear regression. Le, Grambsch, and Liu (2005) constructed a maximum likelihood estimator for a parallel-line assay in which there are multiple sources of variability and compared its bias and mean square error with the standard maximum likelihood estimator, which ignores the extra variability. Within the present context (quantal-response, parallel-line assays), we found no study that compared interval estimation of RP among methods.
Regarding experiment designs within the parallel-line assay context, often, control and treatment groups receive the same dose levels (equal-dose design). However, assigning doses that target the same response level between groups (equal-response design) provides more statistical power (Landes et al., 2013). We could not find literature investigating whether RP estimation methods depend on experiment design.
We note that RP estimation often depends on an estimated quantal response parameter, e.g., LD50, from reference animals and from test animals. Others, such as Faraggi et al. (2003), Kelly (2001a, 2001b), Sitter & Wu (1993) and Williams (1986), have well-studied interval estimation of LD50. Indeed, methods for producing interval estimates of LD50 can be adapted for RP, and we do that here. But performance of these methods for RP remain unstudied until now. Because we consider equal-response designs, which imply different-dose designs, performance of intervals for LD50 does not necessarily follow for RP intervals’ performance.
In this work, we compare coverage probabilities and power among five interval estimators of RP. We also consider these properties under equal-response experiment designs (preferred) and under equal-dose experiment designs (commonly used). Finally, we make recommendations on which methods are preferred and under what conditions.
2. MODELS, METHODS, AND DESIGNS
We frame our model in terms of animal deaths, y, among a group of n animals receiving dose X* of a toxin. Defining LDP to be the dose lethal to P% of a population, we hypothesize that LDP for animals treated with an antitoxin is higher than the LDP for controls. In radiation countermeasure research, the preferred measure of RP is called a dose reduction factor and is
2.1. Model
Under treatment j with j = 0 or 1, njk animals receive dose X*jk with k = 1, 2, …, mj; of these, Yjk animals die. We model
| Display 1 |
where Φ is the standard normal distribution function, Tj indicates receipt of treatment (1) or control (0), and Xjk is log10 of X*jk. This model is a probit regression.
Typically, researchers are interested in three parameters available from this model: the doses that cause death in 50% of control animals, LD50(0) and in 50% of treated animals, LD50(1), and the potency of dose in control animals, relative to treated animals, a.k.a. relative potency, RP = LD50(1) / LD50(0). Since researchers most often aim to estimate and test RP, and estimation and testing methods for LD50 are well studied, we focus on RP. From Display 1, the base-10 logarithm of RP is θ = −α1/β. We describe confidence interval estimation methods below in terms of θ.
2.2. Methods
We consider five methods for constructing interval estimates: two use computer intensive methods (Bayes’ and bootstrap), one requires numerical maximization (inverted likelihood ratio test), and two use a closed-form formula (Fieller’s theorem and Wald’s method). Supplementary material contains R and WinBUGS code that implements each of the methods below on an example data set. For some of the methods below, power must be evaluated with simulation; hence, we use simulation to evaluate power for all of the methods.
2.2.1. Bayes’ Method
We adapt the method for estimating LD50 of a population presented in Gelman, Carlin, Stern & Rubin (1995). From Display 1, assign priors αj ~ N(Aj, V(α)j) for j = 0, 1, and β ~ N(B, V(β)), choosing V(*) to be sufficiently diffuse. With these priors and the data model described by Display 1, obtain an empirical posterior distribution of θ|y via Markov Chain Monte Carlo, where θ = −α1/β, and y, the observed deaths. The q/2 and 1-q/2 quantiles form a (1-q)100% credible interval (CI1) for θ. The maximum likelihood estimates (MLEs) of (α0, α1, β) and values near them provide useful starting values for the Markov Chain Monte Carlo.
Regarding values for prior variances, V(*): Smaller, presumably more informative, values for V(α)j and V(β) shorten CIs for αj and β, as can be expected. However, for the model in Display 1, CIs for θ, a ratio of α1 and β, increase in width; this is especially true in this setting if values of prior means, Aj and B, remain relatively far from MLEs of αj and β. Pragmatically, the priors will be sufficiently diffuse when V(α)j > |Aj – MLE(αj)|2 and V(β) > |B – MLE(β)|2.
2.2.2. Bootstrap
We adapt the method for estimating LD50 of a population presented in Kelly (2001). Obtain the MLEs of (α0, α1, β) from the original data set. Keeping the original (T, X, n) data, substitute the MLEs into the model described by Display 1, and generate a large number of bootstrap data sets of Y. Since the originally estimated parameters are used to generate response data Y, this is a parametric bootstrap. We generated 1000 bootstrap data sets (Davison & Hinkley, 1997). From each bootstrapped data set, obtain the MLE of (α0, α1, β), and compute estimates for θ. The q/2 and 1-q/2 quantiles from the bootstrap distribution form the (1-q)100% confidence interval (CI) for θ.
2.2.3. Inverted Likelihood Ratio Test
We adapt the method for estimating LD50 of a population presented in William (1986). Parameterizing α0 + α1Tj + βXjk from Display 1 in terms of θ gives α0 + β(Xjk − θTj ). The likelihood ratio test of θ0 = 0 yields the χ2(1) test statistic where L(*) is –2×loglikelihood and the MLE. The values θ∗ that satisfy L(θ0) − L(θ∗) = χ2(1),1-q form the endpoints of an approximate (1 – q)100% CI for θ. However, when a likelihood ratio test of β = 0 cannot be rejected at the q significance level, the (1 – q)100% CI for θ has infinite endpoints (Kelly, 2001).
2.2.4. Fieller’s Method
The upper and lower bounds of Fieller’s (1 – q)100% CI for θ are given by
where and are the MLEs, is the 1 – q /2 quantile from the standard normal distribution, and the value indicated by * in the estimated variance-covariance matrix of . When the denominator, , is negative, the (1 – q)100% CI for θ will either be of infinite length or will be all points outside of a finite interval. This occurs when a size q Wald test of β = 0 is not significant. We note, though β may statistically differ from 0, it does not guarantee that θ statistically differs from 0.
2.2.5. Wald’s Method
The upper and lower bounds of Wald’s (1 – q)100% CI for θ are given by
where , , and are as in Fieller’s Method.
2.3. Designs
Two types of dose designs appear in relative potency literature: equal-dose and equal-response designs. In equal-dose designs, treatment and control dose groups are the same; that is, from Display 1, X*0k = X*1k for all k. Consequently, when RP > 1, the expected response probabilities for treated animals are smaller than those for control animals at any given X*. In equal-response designs, treatment and control dose groups are not equal; that is, X*0k ≠ X*1k for all k. Rather, treatment’s X*1k are chosen to achieve the same expected response probabilities as control’s X*0k. For example, a researcher plans control doses, X*0k, to achieve LD13.6, LD50, and LD86.4, and, based on a hypothesized RP > 1, chooses treatment doses, X*1k, that will achieve LD13.6, LD50, and LD86.4. The treatment doses will necessarily be higher than the corresponding control doses.
3. SIMULATIONS FOR STUDYING COVERAGE PROBABILITIES, POWER, AND DESIGN ISSUES
We conducted two simulation studies. Common to both was the objective of comparing a treatment to a control. The first study was primarily to evaluate coverage probabilities of interval estimators (expressed throughout as percentages); we refer to this study as “Coverage-Probability.” Studying design issues and power was the main objective of a smaller study, referred to as “Design-Power.”
3.1. Coverage-Probability Study
We considered the number of doses (M), the number of animals per treatment×dose (Nd), slope (β) on log10 dose, and effect size (ES, defined below). Experimenters control M and Nd. We chose values for M and Nd that represent small, medium, and large numbers typical of those found in radiation countermeasure experiments (Table S1 in Landes et al., 2013). The M doses for control and treatment targeted the same expected response probabilities; i.e., equal-response design.
In practice, experimental data provide estimates of β and ES. We consider a steep and shallow value of β, the same used in Kodell et al. (2010). The steeper β is similar to those found in radiation countermeasure experiments (Table S1 in Landes et al., 2013). The shallow β comes from an insulin example in Finney (1978, Chapter 18). The values for ES also come from Kodell et al. We note in the probit model (Display 1), the standard deviation of the log-normal distribution of dose is 1/β. Then ES is the relative potency on the log10 scale, divided by the standard deviation, or θ×β = −α1 (Display 1). For steep β, the relative potencies, corresponding to the ES we consider, are within the range of those typically found in radiation countermeasure experiments.
The Coverage-Probability study covers all 54 combinations of these factors {levels}: M {3, 5, 7} × Nd {4, 9, 16} × β {2.91, 23.25} × ES {0, 1, 1.5}. Important for M, the expected response probabilities, P, of LDP were {0.136, 0.50, 0.864} for M=3, {0.05, 0.275, 0.50, 0.725, 0.95} for M=5, and {0.05, 0.20, 0.35, 0.50, 0.65, 0.80, 0.95} for M=7. Related to β and P, αj is needed to compute log10 doses, Xjk = ( Φ−1(Pk) – αj ) / β, where Φ−1 is the inverse c.d.f. of the standard normal distribution. We note that α0 is a location parameter and has no bearing on coverage probabilities or power; α1 = −θ×β, and is the negative of the effect size (ES). Our simulation used these (α0, β) pairs: (1.5714, 2.91) and (−23.25, 23.25). For each of the 54 combinations, we generated 1000 random data sets under the model in Display 1.
Regarding the Bayesian method, we assigned these priors: αj ~ N(0, 1000) for j = 0, 1, and β ~ N(0, 1000). And for the bootstrap method, we obtained 1000 bootstrap datasets for each simulated dataset.
3.2. Design-Power Study
We considered experiment size (M×Nd), dose design, and effect size (ES). Experiment sizes were small (3×4), medium (5×9) or large (7×16); total sample size is 2×M×Nd. Dose designs were equal-response or equal-dose. ES ranged from 0 to 1.79 as relative potencies ranged from 1 to 3.50 in 0.25 increments2. Altogether there were 66 settings (3 experiment sizes/dose design × 2 dose designs /ES × 11 ES). We used the shallow β = 2.91, with α0 = 1.5714, for all settings. Because of the large number of settings, we reduced the number of simulated data sets per setting. For small M and small N, sometimes there was no variation in the responses in each of the M dose groups; i.e., either all survived or all died within each dose group. In those instances, the Newton-Raphson maximization of the likelihood function failed to converge. Knowing this failure would likely occur for some M×N settings, we generated enough data sets (between 250–350) in order to obtain at least 200 simulated data sets for which the likelihood could be maximized3. Data generation was as in Coverage Probability. We used the same priors for the Bayesian method, and same number of bootstrap samples for the bootstrap method as in the first simulation study.
3.3. Analyses of Simulated Data
From each simulated data set, we obtained interval estimates with each of the methods described in Section 2. Practitioners have control over M and Nd, but must estimate regression parameters, represented by β and ES in our simulation studies. For this reason, we report on statistical properties of estimators for each combination M×Nd, averaging over β×ES levels.
The Coverage-Probability study included ES>0, and the Design-Power study included equal-response designs. Hence, when estimating coverage probabilities and power of equal-response designs, we used relevant settings from both simulation studies. Statistical properties of the CI methods in equal-dose designs come only from the Design-Power study.
4. RESULTS OF SIMULATION STUDIES
For all methods, if the likelihood could not be maximized on a data set, then that data set was discarded. M×Nd = 3×4 settings suffered the most with 11.3% failed likelihood maximizations among 8750 equal-response data sets, and 22.0% among 3850 equal-dose data sets. Likelihood maximization rates of all other M×Nd settings exceeded 99.4%.
Both Fieller and inverted likelihood ratio test (iLRT) methods can result in infinite confidence intervals. These infinite CIs were an issue for the M×Nd = 3×4 settings. For equal-response data sets, 6% returned infinite 95% CIs for the iLRT and Fieller methods. And for equal-dose data sets, infinite 95% CIs resulted in 11% and 16% of the data sets for iLRT and Fieller, respectively. The bootstrap method failed to provide at least 200 bootstrap estimates from which to compute a CI for 3 of the 60,000+ data sets from all sample sizes; and 1 data sets resulted in empirical posterior distributions that had fewer than 1000 iterates for the Bayes CIs. Hence, when estimating coverage probabilities and power, we report on those data sets for which all methods produced a viable CI.
4.1. Coverage Probabilities of Interval Estimators of RP
Table 1 contains coverage error rates of the 5 methods when applied in equal-response design settings. Fieller’s CIs held the nominal 95% level most closely among all the methods, not deviating by more than 0.5 percentage points when sample sizes were ≥40, and was 97.1% for the smallest sample size of 24. Wald and iLRT CIs were next best overall, holding nominal levels for most sample sizes examined. At worst, confidence coefficients for Wald and iLRT CIs were estimated at 93.6% and 93.3%, respectively. CIs from the bootstrap and Bayes methods were too narrow (optimistic) except for the 2 highest sample sizes examined, Ns = 160 and 224. This pattern of results persisted for 90% CIs.
Table 1.
Percent error rates for nominally 95% and 90% confidence intervalsa when using equal-response designs; computed on data sets for which all methods returned a viable CI.
| Nominal CI Level | Total N | 24 | 40 | 54 | 56 | 90 | 96 | 126 | 160 | 224 | Range of | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M×Nd | 3×4 | 5×4 | 3×9 | 7×4 | 5×9 | 3×16 | 7×9 | 5×16 | 7×16 | Overall | Error ratesc | |
| 95% | Fieller | 2.9 | 5.1 | 4.9 | 5.0 | 5.4 | 5.1 | 5.0 | 4.5 | 5.1 | 4.8 | 1.0 – 8.4 |
| Wald | 6.3 | 6.6 | 5.1 | 5.4 | 5.7 | 5.3 | 5.1 | 4.7 | 5.1 | 5.5 | 2.4 – 8.8 | |
| iLRT | 5.4 | 6.7 | 5.4 | 5.9 | 5.9 | 5.5 | 5.4 | 4.8 | 5.3 | 5.6 | 2.4 – 8.8 | |
| Bootstrap | 7.6 | 7.8 | 6.2 | 6.8 | 6.1 | 5.9 | 5.7 | 5.0 | 5.5 | 6.3 | 3.2 – 9.9 | |
| Bayes | 8.4 | 8.2 | 6.3 | 6.7 | 6.3 | 5.7 | 5.7 | 5.3 | 5.5 | 6.4 | 2.8 – 10.9 | |
| Contributing data setsb | 7304 | 5977 | 5957 | 5999 | 8750 | 5998 | 6000 | 6000 | 8750 | 60735 | ||
| 90% | Fieller | 9.9 | 11.2 | 10.1 | 11.2 | 10.7 | 10.6 | 10.3 | 9.4 | 10.2 | 10.4 | 6.8 – 15.6 |
| Wald | 11.9 | 12.4 | 10.5 | 11.3 | 11.1 | 10.6 | 10.3 | 9.6 | 10.2 | 10.9 | 6.8 – 15.6 | |
| iLRT | 12.4 | 12.7 | 11.0 | 11.9 | 11.2 | 10.8 | 10.5 | 9.5 | 10.3 | 11.1 | 6.8 – 15.7 | |
| Bootstrap | 13.4 | 13.4 | 11.7 | 12.3 | 11.7 | 11.0 | 11.0 | 10.0 | 10.5 | 11.7 | 6.8 – 16.8 | |
| Bayes | 14.4 | 14.6 | 12.3 | 12.8 | 12.2 | 11.6 | 11.2 | 10.1 | 10.6 | 12.2 | 7.6 – 17.6 | |
| Contributing data setsb | 7352 | 5976 | 5956 | 5992 | 8747 | 5992 | 5999 | 5991 | 8745 | 60750 | ||
Credible intervals for the Bayes method.
Monte Carlo error was no greater than [√(0.917×0.082/5977)]×100% = 0.35 percentage points for 95% CIs and [√(0.854×0.146/5976)]×100% = 0.46 percentage points for 90% CIs.
Error rates over 84 M×Nd×β×ES combinations of equal-response designs in both the Coverage-Probabilty and Design-Power simulations. Combinations in the Coverage-Probability had 1000 planned data sets, and those in Design-Power had 200 planned data sets.
Though we focus on equal-response designs because they are more efficient than equal-dose designs, we estimated coverage probabilities from simulated data under equal-dose designs. Equal-dose designs were considered in the Design-Power study, for samples sizes of 3×4, 5×9, and 7×16 in a shallow-slope scenario; see Table 2. Overall, Wald CIs held most closely to nominal level. The iLRT method was second closest, but was too wide (conservative) for the smallest sample size. Fieller CIs tended to be too wide and Bayes CIs too narrow (optimistic).
Table 2.
Percent error rates for nominally 95% and 90% confidence intervalsa when using equal-dose designs; computed on data sets for which all methods returned a viable CI.
| Nominal CI Level | Total N | 24 | 90 | 224 | Range of | |
|---|---|---|---|---|---|---|
| M×Nd | 3×4 | 5×9 | 7×16 | Overall | Error rates | |
| 95% | Fieller | 2.0 | 4.7 | 5.3 | 4.0 | 0.6 – 7.6 |
| Wald | 5.4 | 5.0 | 4.8 | 5.1 | 2.0 – 12.4 | |
| iLRT | 4.1 | 5.2 | 5.4 | 4.9 | 1.2 – 9.2 | |
| Bootstrap | 5.4 | 5.8 | 5.7 | 5.6 | 1.8 – 10.0 | |
| Bayes | 5.9 | 5.9 | 5.7 | 5.9 | 1.2 – 9.6 | |
| Contributing data setsb | 2538 | 2749 | 2750 | 8037 | ||
| 90% | Fieller | 7.1 | 9.8 | 9.9 | 8.9 | 1.9 – 14.1 |
| Wald | 9.4 | 10.5 | 10.0 | 10.0 | 6.4 – 13.3 | |
| iLRT | 9.4 | 10.4 | 10.0 | 9.9 | 3.4 – 14.1 | |
| Bootstrap | 10.0 | 11.4 | 10.5 | 10.6 | 5.8 – 15.3 | |
| Bayes | 11.8 | 11.5 | 10.2 | 11.1 | 6.4 – 15.2 | |
| Contributing data setsb | 2798 | 2749 | 2749 | 8296 | ||
Credible intervals for the Bayes method.
Monte Carlo error was no greater than [√(0.941×0.059/2538)]×100% = 0.47 percentage points for 95% CIs and [√(0.885×0.115/2749)]×100% = 0.61 percentage points for 90% CIs.
Error rates over 33 M×Nd×ES combinations of equal-dose designs in the Design-Power simulation study; combinations had 200 planned data sets.
4.2. Power for Detecting Effect Size > 0
Here, we report on power of .05 level, one-tailed tests for detecting a specified effect size (ES > 0) versus the null hypothesis that ES0 = 0. For all ES > 0 in the Coverage-Probability simulation, estimated power exceeded 0.90 when the total sample size (2×M×Nd) was 90 or more. Similarly, in the Design-Power simulation, power differences among the methods and dose designs were negligible when total sample size was 90 or more. We thus report on the smaller sample sizes where important differences among the methods can be easily seen. Figure 1 displays estimated power by total sample size from equal-response designs. Figure 2 displays estimated power by effect size for small (2×M×Nd = 24), equal-response designed experiments in which the slope is shallow (β = 2.91).
Figure 1.
From equal-response designs in both the Coverage-Probability and Design-Power studies, power is plotted by total sample size. Power estimated for detecting effect sizes of 1.0 (top panels) and 1.5 (bottom panels) with one-sided, .05 level tests conducted with the lower bound of 90% CIs. Left panels are from β=23.25, right panels from β=2.91.The horizontal black line indicates power of 0.80.
Figure 2.
From equal-response designs in the Design-Power study, power is plotted by effect size. Power assumed one-sided, .05 level tests conducted with the lower bound of 90% CIs produced with the indicated method, and based on (2×M×Nd = 2×3×4=) 24 hypothetical animals. The horizontal black line indicates power of 0.80.
From the Coverage-Probability simulation, Wald’s CIs provided significantly more power than the other CI methods for total sample sizes under 90; Fieller’s CIs provided the least. This pattern was true for both steep and shallow slopes, and for smaller and larger effect sizes (Fig. 1a–d). Considering only equal-response designs in the Design-Power simulation (Fig. 2), Wald’s CIs continued to provide more power than the other methods, and Fieller’s method remained the least powerful.
4.3. Dose Design Effect on Power
Dose design, namely equal-dose vs. equal-response, mattered when using Wald intervals for hypothesis testing: as the true effect size increased, power increased at a slower rate in equal-dose designs than in equal-response designs. Though not as drastic, power from using bootstrap intervals exhibited a similar pattern. Design type made little difference on power in the remaining methods – Fieller’s, iLRT and Bayes’. See Fig. 3a–e.
Figure 3.
From the Design-Power study, power is plotted by effect size. Power assumed one-sided, .05 level tests conducted with the lower bound of 90% CIs produced with the indicated method, and based on (2×M×Nd = 2×3×4=) 24 hypothetical animals. Solid and dotted lines represent equal-response and equal-dose designs, respectively. The horizontal line indicates power of 0.80.
5. DISCUSSION
In a parallel-line assay context, we have examined five methods for producing confidence (or credible) intervals (CIs) for relative potency (RP). We compared coverage probabilities of the CIs among the methods, and also power when using these CIs for hypothesis tests. Because experiment designs of parallel-line assays tend to be either equal-response or equal-dose, we also examined how the CI methods perform under these two experiment designs. We used simulation studies to compare the methods. The simulation studies covered a realistic range of regression parameter settings and sample sizes used in radiation countermeasure research. For total sample sizes of 90 or more, the methods differed little; hence we focused our results on smaller sample sizes (≤ 56).
In small sized, equal-response experiment designs, Fieller’s and Wald’s CIs tended to be closest to nominal level; iLRT was also reasonably close to nominal levels. The other two – Bootstrap and Bayes – were consistently below nominal levels. Though Fieller’s CIs were most true to nominal levels, using Fieller’s CIs for hypothesis testing provided the least power of all the methods. On the other hand, Wald’s CIs clearly had the best power in small sized experiments, and exceeded Fieller’s power from between 5 to 15 percentage points in settings tending to have less than 0.80 power. As sample sizes increased (≥ 90), coverage probabilities from all methods approached nominal levels. Additionally, power based on the CI method for hypothesis testing exceeded 0.80 for all methods when sample sizes were ≥90.
Among equal-dose experiments, the only sample size less than 90 we considered was 24 in a shallow dose-response scenario. Wald and iLRT methods held most closely to nominal levels. An interesting result was how power differed between equal-dose and equal-response experiments with all other parameters being equal (method, sample size, size of RP, dose-response slope). Power increased very slowly for Wald’s method compared to Wald’s method in equal-response experiments. Also, Wald’s method provided the least power among the other methods in equal-dose designs – a complete reversal from the statistically more efficient equal-response designs. Changes seen in Wald’s power between the two experiment designs were similar for power from the Bootstrap method, though the difference was not as great as that seen in Wald’s power. Finally, power differences between the two experiment designs were negligible when using Bayes’ method.
For the statistically more efficient equal-response designs, we recommend using Wald CIs over the others considered here. Though Wald’s method is considered a large-sample method, its statistical properties were very good for the small sample sizes we examined. Not only do Wald CIs tend to hold nominal confidence levels and provide good power in statistical testing, this formula-based method is easy for researchers to implement using almost any statistical software. Landes et al. (2013) explains how to construct Wald CIs for non-statistician researchers and supplies worksheets in the supplementary material that allow the researchers to input their results to obtain Wald CIs and to help in planning appropriately powered experiments.
Though equal-dose designs are not as statistically efficient as equal-response designs, researchers will continue to use them, whether out of tradition or a potentially valid reason. For these experiment designs, we recommend Bayes’ method. Its CIs held reasonably close to nominal levels and power was also good compared to the other methods. Bayes’ CIs require more statistical sophistication than Wald’s CIs, possibly requiring help from a statistical analyst. And hopefully, the statistical analyst will recommend equal-response designs to the researcher for any future experiments.
Using the most powerful statistical designs and analyses is always preferred, but is an ethical requirement when experiments involve animal sacrifice. It does not matter how difficult the method might be to implement, researchers must strive to do the right thing. We learned that when a statistically inefficient same-dose design is used, a sophisticated statistical method – Bayes’ method – is preferred. Fortunately though, data from a statistically efficient same-response design leverage more power when using a simple method – Wald’s method – to draw statistical inference.
Supplementary Material
ACKNOWLEDGEMENTS
We are very grateful to Ralph L. Kodell for his insight into and feedback on this work. This work was partially supported by the following grants: (i) the Translational Research Institute (TRI), grant UL1TR000039 through the NIH National Center for Research Resources and National Center for Advancing Translational Sciences, (ii) grant 1P20GM109005 through the NIH Center of Biological Research Excellence, and (iii) grant R21CA184756 through the NIH National Cancer Institute. MHJ received support from U19 AI67798 (NIAID) and the Veterans Administration. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
We abbreviate both credible intervals and confidence intervals with CI. Though these differ in interpretation of θ, analysts use them in the same way to make statistical inference on θ.
We accidently coded the last increment of 3.5 as 3.55. Hence, we report on 3.55 instead of 3.5.
This number (200) of simulations keeps the Monte Carlo error for power estimates at less than .
REFERENCES
- Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge University Press: Cambridge, 1997; 27–31. [Google Scholar]
- Farragi D, Izikson P, Reiser B. Confidence intervals for the 50 per cent response dose. Statistics in Medicine 2003; 22:1977–1988. [DOI] [PubMed] [Google Scholar]
- FDA (U. S. Food and Drug Administration). New drug and biological drug products: Evidence needed to demonstrate effectiveness of new drugs when human efficacy studies are not ethical or feasible. Federal Register 2002; 67:37988–37998. [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall: London, 1995; 82–86. [Google Scholar]
- Iturria SJ. Statistical inference for relative potency in bivariate dose-response assays with correlated responses. Journal of Biopharmaceutical Statistics 2005; 15:343–351. [DOI] [PubMed] [Google Scholar]
- Kelly GE. The median lethal dose—design and estimation. The Statistician 2001; 50:41–50. [Google Scholar]
- Kelly GE. Corrigendum: The median lethal dose—design and estimation. The Statistician 2001; 50:364–366. [Google Scholar]
- Kodell RL, Lensing SY, Landes RD, Kumar KS, Hauer-Jensen M. Determination of sample sizes for demonstrating efficacy in radiation countermeasures. Biometrics 2010; 66:238–248, DOI: 10.1111/j.1541-0420.2009.01236.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landes RD, Lensing SY, Kodell RL, Hauer-Jensen M. Practical advice on calculating confidence intervals for radioprotection effects and reducing animal numbers in radiation countermeasures experiments. Radiation Research 2013; 180:567–574, DOI: 10.1667/RR13429.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Le C, Grambsch P, Liu A. Estimation of cancer drug potencies and relative potencies from in vitro data. Journal of Biopharmaceutical Statistics 2005; 15:903–912. [DOI] [PubMed] [Google Scholar]
- Sitter RR, Wu CFJ. On the accuracy of Fieller intervals for binary response data. Journal of the American Statistician 1993; 88:1021–1025. [Google Scholar]
- USDA (United States Department of Agriculture). Supplemental assay method for potency testing of inactivated rabies vaccine in mice using the National Institutes of Health test. SAM 308.04 2011; 12 pages. Online at http://www.aphis.usda.gov/animal_health/vet_biologics/publications/308.pdf (accessed 5 November 2014).
- Wilbur LA, Aubert MFA. The NIH test for potency In: Meslin FX, Kaplan MM, Koprowski H, editors. Laboratory Techniques in Rabies, 4th ed Geneva, World Health Organization, 1996 [Google Scholar]
- Willams DA. An exact confidence interval for the relative potency estimated from a multivariate bioassay. Biometrics 1988; 44:861–867. [Google Scholar]
- William DA. Interval estimation of the median lethal dose. Biometrics 1986; 42:641–645. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



