INTERVAL ESTIMATORS OF RELATIVE POTENCY IN TOXICOLOGY AND RADIATION COUNTERMEASURE STUDIES: COMPARING METHODS AND EXPERIMENTAL DESIGNS

Reid D Landes; Shelly Y Lensing; Martin Hauer-Jensen

doi:10.1080/10543406.2018.1535500

. Author manuscript; available in PMC: 2020 Jan 1.

Published in final edited form as: J Biopharm Stat. 2018 Oct 23;29(2):348–358. doi: 10.1080/10543406.2018.1535500

INTERVAL ESTIMATORS OF RELATIVE POTENCY IN TOXICOLOGY AND RADIATION COUNTERMEASURE STUDIES: COMPARING METHODS AND EXPERIMENTAL DESIGNS

Reid D Landes ^a, Shelly Y Lensing ^a, Martin Hauer-Jensen ^b

PMCID: PMC6450382 NIHMSID: NIHMS1013772 PMID: 30352015

Abstract

The relative potency of one agent to another is commonly represented by the ratio of two quantal response parameters; for example, the LD₅₀ of animals receiving a treatment to the LD₅₀ of control animals, where LD₅₀ is the dose of toxin that is lethal to 50% of animals. Though others have considered interval estimators of LD₅₀, here, we extend Bayesian, bootstrap, likelihood ratio, Fieller’s and Wald’s methods to estimate intervals for relative potency in a parallel-line assay context. In addition to comparing their coverage probabilities, we also consider their power in two types of dose designs: one assigning treatment and control the same doses versus one choosing doses for treatment and control to achieve same lethality targets. We explore these methods in realistic contexts of relative potency of radiation countermeasures. For larger experiments (e.g., ≥ 100 animals), the methods return similar results regardless of the interval estimation method or experiment design. For smaller experiments (e.g., < 60 animals), Wald’s method stands out among the others, producing intervals that hold closely to nominal levels and providing more power than the other methods in statistically efficient designs. Using this simple statistical method within a statistically efficient design, researchers can reduce animal numbers.

Keywords: power; sample size; parallel-line assay; dose reduction factor, 3Rs

1. INTRODUCTION

Regulatory agencies like the U.S. Food and Drug Administration (2011), the U. S. Department of Agriculture (2002), and World Health Organization (Wilbur, Aubert, 1996) require measures of relative potency (RP) on certain pharmaceuticals like radiation countermeasures and vaccines. However, these agencies provide little guidance on statistical methods for producing the required measures and especially interval estimates of these measures. Consequently, research reports often fail to include measures of uncertainty or even a statistical test when reporting RP in subject specific journals [see Landes, Lensing, Kodell, Hauer-Jensen, 2013 for a review].

In radiation countermeasure research and in vaccine research and production, estimating RP often requires administering toxin (e.g., radiation or viral load) in increasing amounts to groups of control and treated animals, and then recording how many animals survive to the end of the experiment. RPs are typically ratios of treatment and control potency values. For binary outcomes analyzed with probit or logistic regression given log-dose, estimated RPs implicitly assume dose-responses for control and treatment groups are parallel. Hence, the experiment conducted to estimate RP is a parallel-line assay. For these types of experiments, we compare statistical properties of interval estimators of RP, and make recommendations on which estimators to use.

Though several researchers have developed methods to make statistical inference on RP, most have concentrated on special cases. A few, though, have considered more than one method. Williams (1988) developed an exact confidence interval method appropriate for multivariate normal response vectors coming from either parallel-line or slope-ratio models, and compared it to likelihood ratio intervals and a modified version of the likelihood ratio interval on a single data set; no comparison of statistical properties was provided. Iturria (2005) considered three different tests (but not interval estimates) of RP when the responses are correlated between the control and treatment group, and are modeled with nonlinear regression. Le, Grambsch, and Liu (2005) constructed a maximum likelihood estimator for a parallel-line assay in which there are multiple sources of variability and compared its bias and mean square error with the standard maximum likelihood estimator, which ignores the extra variability. Within the present context (quantal-response, parallel-line assays), we found no study that compared interval estimation of RP among methods.

Regarding experiment designs within the parallel-line assay context, often, control and treatment groups receive the same dose levels (equal-dose design). However, assigning doses that target the same response level between groups (equal-response design) provides more statistical power (Landes et al., 2013). We could not find literature investigating whether RP estimation methods depend on experiment design.

We note that RP estimation often depends on an estimated quantal response parameter, e.g., LD₅₀, from reference animals and from test animals. Others, such as Faraggi et al. (2003), Kelly (2001a, 2001b), Sitter & Wu (1993) and Williams (1986), have well-studied interval estimation of LD₅₀. Indeed, methods for producing interval estimates of LD₅₀ can be adapted for RP, and we do that here. But performance of these methods for RP remain unstudied until now. Because we consider equal-response designs, which imply different-dose designs, performance of intervals for LD₅₀ does not necessarily follow for RP intervals’ performance.

In this work, we compare coverage probabilities and power among five interval estimators of RP. We also consider these properties under equal-response experiment designs (preferred) and under equal-dose experiment designs (commonly used). Finally, we make recommendations on which methods are preferred and under what conditions.

2. MODELS, METHODS, AND DESIGNS

We frame our model in terms of animal deaths, y, among a group of n animals receiving dose X* of a toxin. Defining LD_P to be the dose lethal to P% of a population, we hypothesize that LD_P for animals treated with an antitoxin is higher than the LD_P for controls. In radiation countermeasure research, the preferred measure of RP is called a dose reduction factor and is

L D_{P}_{, T r e a t e d} / L D_{P}_{, C o n t r o l} .

2.1. Model

Under treatment j with j = 0 or 1, n_jk animals receive dose X*_jk with k = 1, 2, …, m_j; of these, Y_jk animals die. We model

Y_{j k} ~ Binomial (n_{j k}, Φ (α_{0} + α_{1} T_{j} + β X_{j k})),

Display 1

where Φ is the standard normal distribution function, T_j indicates receipt of treatment (1) or control (0), and X_jk is log₁₀ of X*_jk. This model is a probit regression.

Typically, researchers are interested in three parameters available from this model: the doses that cause death in 50% of control animals, LD₅₀₍₀₎ and in 50% of treated animals, LD₅₀₍₁₎, and the potency of dose in control animals, relative to treated animals, a.k.a. relative potency, RP = LD₅₀₍₁₎ / LD₅₀₍₀₎. Since researchers most often aim to estimate and test RP, and estimation and testing methods for LD₅₀ are well studied, we focus on RP. From Display 1, the base-10 logarithm of RP is θ = −α₁/β. We describe confidence interval estimation methods below in terms of θ.

2.2. Methods

We consider five methods for constructing interval estimates: two use computer intensive methods (Bayes’ and bootstrap), one requires numerical maximization (inverted likelihood ratio test), and two use a closed-form formula (Fieller’s theorem and Wald’s method). Supplementary material contains R and WinBUGS code that implements each of the methods below on an example data set. For some of the methods below, power must be evaluated with simulation; hence, we use simulation to evaluate power for all of the methods.

2.2.1. Bayes’ Method

We adapt the method for estimating LD₅₀ of a population presented in Gelman, Carlin, Stern & Rubin (1995). From Display 1, assign priors α_j ~ N(A_j, V^(α)_j) for j = 0, 1, and β ~ N(B_, V^(β)), choosing V^(*) to be sufficiently diffuse. With these priors and the data model described by Display 1, obtain an empirical posterior distribution of θ|y via Markov Chain Monte Carlo, where θ = −α₁/β, and y, the observed deaths. The q/2 and 1-q/2 quantiles form a (1-q)100% credible interval (CI¹) for θ. The maximum likelihood estimates (MLEs) of (α₀, α₁, β) and values near them provide useful starting values for the Markov Chain Monte Carlo.

Regarding values for prior variances, V^(*): Smaller, presumably more informative, values for V^(α)_j and V^(β) shorten CIs for α_j and β, as can be expected. However, for the model in Display 1, CIs for θ, a ratio of α₁ and β, increase in width; this is especially true in this setting if values of prior means, A_j and B, remain relatively far from MLEs of α_j and β. Pragmatically, the priors will be sufficiently diffuse when V^(α)_j > |A_j – MLE(α_j)|² and V^(β) > |B – MLE(β)|².

2.2.2. Bootstrap

We adapt the method for estimating LD₅₀ of a population presented in Kelly (2001). Obtain the MLEs of (α₀, α₁, β) from the original data set. Keeping the original (T, X, n) data, substitute the MLEs into the model described by Display 1, and generate a large number of bootstrap data sets of Y. Since the originally estimated parameters are used to generate response data Y, this is a parametric bootstrap. We generated 1000 bootstrap data sets (Davison & Hinkley, 1997). From each bootstrapped data set, obtain the MLE of (α₀, α₁, β), and compute estimates for θ. The q/2 and 1-q/2 quantiles from the bootstrap distribution form the (1-q)100% confidence interval (CI) for θ.

2.2.3. Inverted Likelihood Ratio Test

We adapt the method for estimating LD₅₀ of a population presented in William (1986). Parameterizing α₀ + α₁T_j + βX_jk from Display 1 in terms of θ gives α₀ + β(X_jk − θT_j ). The likelihood ratio test of θ₀ = 0 yields the χ²₍₁₎ test statistic $L (θ_{0}) - L (\hat{θ})$ where L(*) is –2×loglikelihood and $\hat{θ}$ the MLE. The values θ^∗ that satisfy L(θ₀) − L(θ^∗) = χ²_(1),1-q form the endpoints of an approximate (1 – q)100% CI for θ. However, when a likelihood ratio test of β = 0 cannot be rejected at the q significance level, the (1 – q)100% CI for θ has infinite endpoints (Kelly, 2001).

2.2.4. Fieller’s Method

The upper and lower bounds of Fieller’s (1 – q)100% CI for θ are given by

\frac{- (v_{α_{1} β} z_{*}^{2} + {\hat{α}}_{1} \hat{β}) \pm \sqrt{{(v_{α_{1} β} z_{*}^{2} + {\hat{α}}_{1} \hat{β})}^{2} - ({\hat{β}}^{2} - v_{β β} z_{*}^{2}) ({\hat{α}}_{1}^{2} - v_{α_{1} α_{1}} z_{*}^{2})}}{{\hat{β}}^{2} - v_{β β} z_{*}^{2}}

where ${\hat{α}}_{1}$ and $\hat{β}$ are the MLEs, $z_{*}$ is the 1 – q /2 quantile from the standard normal distribution, and $v_{*}$ the value indicated by * in the estimated variance-covariance matrix of $({\hat{α}}_{0}, {\hat{α}}_{1}, \hat{β})$ . When the denominator, ${\hat{β}}^{2} - v_{β}^{2} z_{*}^{2}$ , is negative, the (1 – q)100% CI for θ will either be of infinite length or will be all points outside of a finite interval. This occurs when a size q Wald test of β = 0 is not significant. We note, though β may statistically differ from 0, it does not guarantee that θ statistically differs from 0.

2.2.5. Wald’s Method

The upper and lower bounds of Wald’s (1 – q)100% CI for θ are given by

\hat{θ} \pm z_{*} \sqrt{[(\frac{v_{α_{1} α_{1}}}{{\hat{α}}_{1}^{2}}) + (\frac{v_{β β}}{{\hat{β}}^{2}}) - 2 (\frac{v_{α_{1} β}}{{\hat{α}}_{1} \hat{β}})] {(\frac{{\hat{α}}_{1}}{\hat{β}})}^{2}},

where ${\hat{α}}_{1}$ , $\hat{β}$ , $z_{*}$ and $v_{*}$ are as in Fieller’s Method.

2.3. Designs

Two types of dose designs appear in relative potency literature: equal-dose and equal-response designs. In equal-dose designs, treatment and control dose groups are the same; that is, from Display 1, X*_0k = X*_1k for all k. Consequently, when RP > 1, the expected response probabilities for treated animals are smaller than those for control animals at any given X*. In equal-response designs, treatment and control dose groups are not equal; that is, X*_0k ≠ X*_1k for all k. Rather, treatment’s X*_1k are chosen to achieve the same expected response probabilities as control’s X*_0k. For example, a researcher plans control doses, X*_0k, to achieve LD_13.6, LD₅₀, and LD_86.4, and, based on a hypothesized RP > 1, chooses treatment doses, X*_1k, that will achieve LD_13.6, LD₅₀, and LD_86.4. The treatment doses will necessarily be higher than the corresponding control doses.

3. SIMULATIONS FOR STUDYING COVERAGE PROBABILITIES, POWER, AND DESIGN ISSUES

We conducted two simulation studies. Common to both was the objective of comparing a treatment to a control. The first study was primarily to evaluate coverage probabilities of interval estimators (expressed throughout as percentages); we refer to this study as “Coverage-Probability.” Studying design issues and power was the main objective of a smaller study, referred to as “Design-Power.”

3.1. Coverage-Probability Study

We considered the number of doses (M), the number of animals per treatment×dose (N_d), slope (β) on log₁₀ dose, and effect size (ES, defined below). Experimenters control M and N_d. We chose values for M and N_d that represent small, medium, and large numbers typical of those found in radiation countermeasure experiments (Table S1 in Landes et al., 2013). The M doses for control and treatment targeted the same expected response probabilities; i.e., equal-response design.

In practice, experimental data provide estimates of β and ES. We consider a steep and shallow value of β, the same used in Kodell et al. (2010). The steeper β is similar to those found in radiation countermeasure experiments (Table S1 in Landes et al., 2013). The shallow β comes from an insulin example in Finney (1978, Chapter 18). The values for ES also come from Kodell et al. We note in the probit model (Display 1), the standard deviation of the log-normal distribution of dose is 1/β. Then ES is the relative potency on the log₁₀ scale, divided by the standard deviation, or θ×β = −α₁ (Display 1). For steep β, the relative potencies, corresponding to the ES we consider, are within the range of those typically found in radiation countermeasure experiments.

The Coverage-Probability study covers all 54 combinations of these factors {levels}: M {3, 5, 7} × N_d {4, 9, 16} × β {2.91, 23.25} × ES {0, 1, 1.5}. Important for M, the expected response probabilities, P, of LD_P were {0.136, 0.50, 0.864} for M=3, {0.05, 0.275, 0.50, 0.725, 0.95} for M=5, and {0.05, 0.20, 0.35, 0.50, 0.65, 0.80, 0.95} for M=7. Related to β and P, α_j is needed to compute log₁₀ doses, X_jk = ( Φ⁻¹(P_k) – α_j ) / β, where Φ⁻¹ is the inverse c.d.f. of the standard normal distribution. We note that α₀ is a location parameter and has no bearing on coverage probabilities or power; α₁ = −θ×β, and is the negative of the effect size (ES). Our simulation used these (α₀, β) pairs: (1.5714, 2.91) and (−23.25, 23.25). For each of the 54 combinations, we generated 1000 random data sets under the model in Display 1.

Regarding the Bayesian method, we assigned these priors: α_j ~ N(0, 1000) for j = 0, 1, and β ~ N(0, 1000). And for the bootstrap method, we obtained 1000 bootstrap datasets for each simulated dataset.

3.2. Design-Power Study

We considered experiment size (M×N_d), dose design, and effect size (ES). Experiment sizes were small (3×4), medium (5×9) or large (7×16); total sample size is 2×M×N_d. Dose designs were equal-response or equal-dose. ES ranged from 0 to 1.79 as relative potencies ranged from 1 to 3.50 in 0.25 increments². Altogether there were 66 settings (3 experiment sizes/dose design × 2 dose designs /ES × 11 ES). We used the shallow β = 2.91, with α₀ = 1.5714, for all settings. Because of the large number of settings, we reduced the number of simulated data sets per setting. For small M and small N, sometimes there was no variation in the responses in each of the M dose groups; i.e., either all survived or all died within each dose group. In those instances, the Newton-Raphson maximization of the likelihood function failed to converge. Knowing this failure would likely occur for some M×N settings, we generated enough data sets (between 250–350) in order to obtain at least 200 simulated data sets for which the likelihood could be maximized³. Data generation was as in Coverage Probability. We used the same priors for the Bayesian method, and same number of bootstrap samples for the bootstrap method as in the first simulation study.

3.3. Analyses of Simulated Data

From each simulated data set, we obtained interval estimates with each of the methods described in Section 2. Practitioners have control over M and N_d, but must estimate regression parameters, represented by β and ES in our simulation studies. For this reason, we report on statistical properties of estimators for each combination M×N_d, averaging over β×ES levels.

The Coverage-Probability study included ES>0, and the Design-Power study included equal-response designs. Hence, when estimating coverage probabilities and power of equal-response designs, we used relevant settings from both simulation studies. Statistical properties of the CI methods in equal-dose designs come only from the Design-Power study.

4. RESULTS OF SIMULATION STUDIES

For all methods, if the likelihood could not be maximized on a data set, then that data set was discarded. M×N_d = 3×4 settings suffered the most with 11.3% failed likelihood maximizations among 8750 equal-response data sets, and 22.0% among 3850 equal-dose data sets. Likelihood maximization rates of all other M×N_d settings exceeded 99.4%.

Both Fieller and inverted likelihood ratio test (iLRT) methods can result in infinite confidence intervals. These infinite CIs were an issue for the M×N_d = 3×4 settings. For equal-response data sets, 6% returned infinite 95% CIs for the iLRT and Fieller methods. And for equal-dose data sets, infinite 95% CIs resulted in 11% and 16% of the data sets for iLRT and Fieller, respectively. The bootstrap method failed to provide at least 200 bootstrap estimates from which to compute a CI for 3 of the 60,000+ data sets from all sample sizes; and 1 data sets resulted in empirical posterior distributions that had fewer than 1000 iterates for the Bayes CIs. Hence, when estimating coverage probabilities and power, we report on those data sets for which all methods produced a viable CI.

4.1. Coverage Probabilities of Interval Estimators of RP

Table 1 contains coverage error rates of the 5 methods when applied in equal-response design settings. Fieller’s CIs held the nominal 95% level most closely among all the methods, not deviating by more than 0.5 percentage points when sample sizes were ≥40, and was 97.1% for the smallest sample size of 24. Wald and iLRT CIs were next best overall, holding nominal levels for most sample sizes examined. At worst, confidence coefficients for Wald and iLRT CIs were estimated at 93.6% and 93.3%, respectively. CIs from the bootstrap and Bayes methods were too narrow (optimistic) except for the 2 highest sample sizes examined, Ns = 160 and 224. This pattern of results persisted for 90% CIs.

Table 1.

Percent error rates for nominally 95% and 90% confidence intervals^a when using equal-response designs; computed on data sets for which all methods returned a viable CI.

Nominal CI Level	Total N	24	40	54	56	90	96	126	160	224		Range of
Nominal CI Level	M×N_d	3×4	5×4	3×9	7×4	5×9	3×16	7×9	5×16	7×16	Overall	Error rates^c
95%	Fieller	2.9	5.1	4.9	5.0	5.4	5.1	5.0	4.5	5.1	4.8	1.0 – 8.4
	Wald	6.3	6.6	5.1	5.4	5.7	5.3	5.1	4.7	5.1	5.5	2.4 – 8.8
	iLRT	5.4	6.7	5.4	5.9	5.9	5.5	5.4	4.8	5.3	5.6	2.4 – 8.8
	Bootstrap	7.6	7.8	6.2	6.8	6.1	5.9	5.7	5.0	5.5	6.3	3.2 – 9.9
	Bayes	8.4	8.2	6.3	6.7	6.3	5.7	5.7	5.3	5.5	6.4	2.8 – 10.9
Contributing data sets^b		7304	5977	5957	5999	8750	5998	6000	6000	8750	60735

90%	Fieller	9.9	11.2	10.1	11.2	10.7	10.6	10.3	9.4	10.2	10.4	6.8 – 15.6
	Wald	11.9	12.4	10.5	11.3	11.1	10.6	10.3	9.6	10.2	10.9	6.8 – 15.6
	iLRT	12.4	12.7	11.0	11.9	11.2	10.8	10.5	9.5	10.3	11.1	6.8 – 15.7
	Bootstrap	13.4	13.4	11.7	12.3	11.7	11.0	11.0	10.0	10.5	11.7	6.8 – 16.8
	Bayes	14.4	14.6	12.3	12.8	12.2	11.6	11.2	10.1	10.6	12.2	7.6 – 17.6
Contributing data sets^b		7352	5976	5956	5992	8747	5992	5999	5991	8745	60750

Open in a new tab

Credible intervals for the Bayes method.

Monte Carlo error was no greater than [√(0.917×0.082/5977)]×100% = 0.35 percentage points for 95% CIs and [√(0.854×0.146/5976)]×100% = 0.46 percentage points for 90% CIs.

Error rates over 84 M×N_d×β×ES combinations of equal-response designs in both the Coverage-Probabilty and Design-Power simulations. Combinations in the Coverage-Probability had 1000 planned data sets, and those in Design-Power had 200 planned data sets.

Though we focus on equal-response designs because they are more efficient than equal-dose designs, we estimated coverage probabilities from simulated data under equal-dose designs. Equal-dose designs were considered in the Design-Power study, for samples sizes of 3×4, 5×9, and 7×16 in a shallow-slope scenario; see Table 2. Overall, Wald CIs held most closely to nominal level. The iLRT method was second closest, but was too wide (conservative) for the smallest sample size. Fieller CIs tended to be too wide and Bayes CIs too narrow (optimistic).

Table 2.

Percent error rates for nominally 95% and 90% confidence intervals^a when using equal-dose designs; computed on data sets for which all methods returned a viable CI.

Nominal CI Level	Total N	24	90	224		Range of
Nominal CI Level	M×N_d	3×4	5×9	7×16	Overall	Error rates
95%	Fieller	2.0	4.7	5.3	4.0	0.6 – 7.6
	Wald	5.4	5.0	4.8	5.1	2.0 – 12.4
	iLRT	4.1	5.2	5.4	4.9	1.2 – 9.2
	Bootstrap	5.4	5.8	5.7	5.6	1.8 – 10.0
	Bayes	5.9	5.9	5.7	5.9	1.2 – 9.6
Contributing data sets^b		2538	2749	2750	8037

90%	Fieller	7.1	9.8	9.9	8.9	1.9 – 14.1
	Wald	9.4	10.5	10.0	10.0	6.4 – 13.3
	iLRT	9.4	10.4	10.0	9.9	3.4 – 14.1
	Bootstrap	10.0	11.4	10.5	10.6	5.8 – 15.3
	Bayes	11.8	11.5	10.2	11.1	6.4 – 15.2
Contributing data sets^b		2798	2749	2749	8296

Open in a new tab

Credible intervals for the Bayes method.

Monte Carlo error was no greater than [√(0.941×0.059/2538)]×100% = 0.47 percentage points for 95% CIs and [√(0.885×0.115/2749)]×100% = 0.61 percentage points for 90% CIs.

Error rates over 33 M×N_d×ES combinations of equal-dose designs in the Design-Power simulation study; combinations had 200 planned data sets.

4.2. Power for Detecting Effect Size > 0

Here, we report on power of .05 level, one-tailed tests for detecting a specified effect size (ES > 0) versus the null hypothesis that ES₀ = 0. For all ES > 0 in the Coverage-Probability simulation, estimated power exceeded 0.90 when the total sample size (2×M×N_d) was 90 or more. Similarly, in the Design-Power simulation, power differences among the methods and dose designs were negligible when total sample size was 90 or more. We thus report on the smaller sample sizes where important differences among the methods can be easily seen. Figure 1 displays estimated power by total sample size from equal-response designs. Figure 2 displays estimated power by effect size for small (2×M×N_d = 24), equal-response designed experiments in which the slope is shallow (β = 2.91).

From equal-response designs in both the Coverage-Probability and Design-Power studies, power is plotted by total sample size. Power estimated for detecting effect sizes of 1.0 (top panels) and 1.5 (bottom panels) with one-sided, .05 level tests conducted with the lower bound of 90% CIs. Left panels are from β=23.25, right panels from β=2.91.The horizontal black line indicates power of 0.80.

From equal-response designs in the Design-Power study, power is plotted by effect size. Power assumed one-sided, .05 level tests conducted with the lower bound of 90% CIs produced with the indicated method, and based on (2×M×N_d = 2×3×4=) 24 hypothetical animals. The horizontal black line indicates power of 0.80.

From the Coverage-Probability simulation, Wald’s CIs provided significantly more power than the other CI methods for total sample sizes under 90; Fieller’s CIs provided the least. This pattern was true for both steep and shallow slopes, and for smaller and larger effect sizes (Fig. 1a–d). Considering only equal-response designs in the Design-Power simulation (Fig. 2), Wald’s CIs continued to provide more power than the other methods, and Fieller’s method remained the least powerful.

4.3. Dose Design Effect on Power

Dose design, namely equal-dose vs. equal-response, mattered when using Wald intervals for hypothesis testing: as the true effect size increased, power increased at a slower rate in equal-dose designs than in equal-response designs. Though not as drastic, power from using bootstrap intervals exhibited a similar pattern. Design type made little difference on power in the remaining methods – Fieller’s, iLRT and Bayes’. See Fig. 3a–e.

From the Design-Power study, power is plotted by effect size. Power assumed one-sided, .05 level tests conducted with the lower bound of 90% CIs produced with the indicated method, and based on (2×M×N_d = 2×3×4=) 24 hypothetical animals. Solid and dotted lines represent equal-response and equal-dose designs, respectively. The horizontal line indicates power of 0.80.

5. DISCUSSION

In a parallel-line assay context, we have examined five methods for producing confidence (or credible) intervals (CIs) for relative potency (RP). We compared coverage probabilities of the CIs among the methods, and also power when using these CIs for hypothesis tests. Because experiment designs of parallel-line assays tend to be either equal-response or equal-dose, we also examined how the CI methods perform under these two experiment designs. We used simulation studies to compare the methods. The simulation studies covered a realistic range of regression parameter settings and sample sizes used in radiation countermeasure research. For total sample sizes of 90 or more, the methods differed little; hence we focused our results on smaller sample sizes (≤ 56).

In small sized, equal-response experiment designs, Fieller’s and Wald’s CIs tended to be closest to nominal level; iLRT was also reasonably close to nominal levels. The other two – Bootstrap and Bayes – were consistently below nominal levels. Though Fieller’s CIs were most true to nominal levels, using Fieller’s CIs for hypothesis testing provided the least power of all the methods. On the other hand, Wald’s CIs clearly had the best power in small sized experiments, and exceeded Fieller’s power from between 5 to 15 percentage points in settings tending to have less than 0.80 power. As sample sizes increased (≥ 90), coverage probabilities from all methods approached nominal levels. Additionally, power based on the CI method for hypothesis testing exceeded 0.80 for all methods when sample sizes were ≥90.

Among equal-dose experiments, the only sample size less than 90 we considered was 24 in a shallow dose-response scenario. Wald and iLRT methods held most closely to nominal levels. An interesting result was how power differed between equal-dose and equal-response experiments with all other parameters being equal (method, sample size, size of RP, dose-response slope). Power increased very slowly for Wald’s method compared to Wald’s method in equal-response experiments. Also, Wald’s method provided the least power among the other methods in equal-dose designs – a complete reversal from the statistically more efficient equal-response designs. Changes seen in Wald’s power between the two experiment designs were similar for power from the Bootstrap method, though the difference was not as great as that seen in Wald’s power. Finally, power differences between the two experiment designs were negligible when using Bayes’ method.

For the statistically more efficient equal-response designs, we recommend using Wald CIs over the others considered here. Though Wald’s method is considered a large-sample method, its statistical properties were very good for the small sample sizes we examined. Not only do Wald CIs tend to hold nominal confidence levels and provide good power in statistical testing, this formula-based method is easy for researchers to implement using almost any statistical software. Landes et al. (2013) explains how to construct Wald CIs for non-statistician researchers and supplies worksheets in the supplementary material that allow the researchers to input their results to obtain Wald CIs and to help in planning appropriately powered experiments.

Though equal-dose designs are not as statistically efficient as equal-response designs, researchers will continue to use them, whether out of tradition or a potentially valid reason. For these experiment designs, we recommend Bayes’ method. Its CIs held reasonably close to nominal levels and power was also good compared to the other methods. Bayes’ CIs require more statistical sophistication than Wald’s CIs, possibly requiring help from a statistical analyst. And hopefully, the statistical analyst will recommend equal-response designs to the researcher for any future experiments.

Using the most powerful statistical designs and analyses is always preferred, but is an ethical requirement when experiments involve animal sacrifice. It does not matter how difficult the method might be to implement, researchers must strive to do the right thing. We learned that when a statistically inefficient same-dose design is used, a sophisticated statistical method – Bayes’ method – is preferred. Fortunately though, data from a statistically efficient same-response design leverage more power when using a simple method – Wald’s method – to draw statistical inference.

Supplementary Material

NIHMS1013772-supplement-Supplementary_Material.zip^{(5KB, zip)}

ACKNOWLEDGEMENTS

We are very grateful to Ralph L. Kodell for his insight into and feedback on this work. This work was partially supported by the following grants: (i) the Translational Research Institute (TRI), grant UL1TR000039 through the NIH National Center for Research Resources and National Center for Advancing Translational Sciences, (ii) grant 1P20GM109005 through the NIH Center of Biological Research Excellence, and (iii) grant R21CA184756 through the NIH National Cancer Institute. MHJ received support from U19 AI67798 (NIAID) and the Veterans Administration. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Footnotes

We abbreviate both credible intervals and confidence intervals with CI. Though these differ in interpretation of θ, analysts use them in the same way to make statistical inference on θ.

We accidently coded the last increment of 3.5 as 3.55. Hence, we report on 3.55 instead of 3.5.

This number (200) of simulations keeps the Monte Carlo error for power estimates at less than $\sqrt [(½ \times ½) / 200] \approx .035$ .

REFERENCES

Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge University Press: Cambridge, 1997; 27–31. [Google Scholar]
Farragi D, Izikson P, Reiser B. Confidence intervals for the 50 per cent response dose. Statistics in Medicine 2003; 22:1977–1988. [DOI] [PubMed] [Google Scholar]
FDA (U. S. Food and Drug Administration). New drug and biological drug products: Evidence needed to demonstrate effectiveness of new drugs when human efficacy studies are not ethical or feasible. Federal Register 2002; 67:37988–37998. [PubMed] [Google Scholar]
Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall: London, 1995; 82–86. [Google Scholar]
Iturria SJ. Statistical inference for relative potency in bivariate dose-response assays with correlated responses. Journal of Biopharmaceutical Statistics 2005; 15:343–351. [DOI] [PubMed] [Google Scholar]
Kelly GE. The median lethal dose—design and estimation. The Statistician 2001; 50:41–50. [Google Scholar]
Kelly GE. Corrigendum: The median lethal dose—design and estimation. The Statistician 2001; 50:364–366. [Google Scholar]
Kodell RL, Lensing SY, Landes RD, Kumar KS, Hauer-Jensen M. Determination of sample sizes for demonstrating efficacy in radiation countermeasures. Biometrics 2010; 66:238–248, DOI: 10.1111/j.1541-0420.2009.01236.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Landes RD, Lensing SY, Kodell RL, Hauer-Jensen M. Practical advice on calculating confidence intervals for radioprotection effects and reducing animal numbers in radiation countermeasures experiments. Radiation Research 2013; 180:567–574, DOI: 10.1667/RR13429.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Le C, Grambsch P, Liu A. Estimation of cancer drug potencies and relative potencies from in vitro data. Journal of Biopharmaceutical Statistics 2005; 15:903–912. [DOI] [PubMed] [Google Scholar]
Sitter RR, Wu CFJ. On the accuracy of Fieller intervals for binary response data. Journal of the American Statistician 1993; 88:1021–1025. [Google Scholar]
USDA (United States Department of Agriculture). Supplemental assay method for potency testing of inactivated rabies vaccine in mice using the National Institutes of Health test. SAM 308.04 2011; 12 pages. Online at http://www.aphis.usda.gov/animal_health/vet_biologics/publications/308.pdf (accessed 5 November 2014).
Wilbur LA, Aubert MFA. The NIH test for potency In: Meslin FX, Kaplan MM, Koprowski H, editors. Laboratory Techniques in Rabies, 4th ed Geneva, World Health Organization, 1996 [Google Scholar]
Willams DA. An exact confidence interval for the relative potency estimated from a multivariate bioassay. Biometrics 1988; 44:861–867. [Google Scholar]
William DA. Interval estimation of the median lethal dose. Biometrics 1986; 42:641–645. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS1013772-supplement-Supplementary_Material.zip^{(5KB, zip)}

[R1] Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge University Press: Cambridge, 1997; 27–31. [Google Scholar]

[R2] Farragi D, Izikson P, Reiser B. Confidence intervals for the 50 per cent response dose. Statistics in Medicine 2003; 22:1977–1988. [DOI] [PubMed] [Google Scholar]

[R3] FDA (U. S. Food and Drug Administration). New drug and biological drug products: Evidence needed to demonstrate effectiveness of new drugs when human efficacy studies are not ethical or feasible. Federal Register 2002; 67:37988–37998. [PubMed] [Google Scholar]

[R4] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall: London, 1995; 82–86. [Google Scholar]

[R5] Iturria SJ. Statistical inference for relative potency in bivariate dose-response assays with correlated responses. Journal of Biopharmaceutical Statistics 2005; 15:343–351. [DOI] [PubMed] [Google Scholar]

[R6] Kelly GE. The median lethal dose—design and estimation. The Statistician 2001; 50:41–50. [Google Scholar]

[R7] Kelly GE. Corrigendum: The median lethal dose—design and estimation. The Statistician 2001; 50:364–366. [Google Scholar]

[R8] Kodell RL, Lensing SY, Landes RD, Kumar KS, Hauer-Jensen M. Determination of sample sizes for demonstrating efficacy in radiation countermeasures. Biometrics 2010; 66:238–248, DOI: 10.1111/j.1541-0420.2009.01236.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Landes RD, Lensing SY, Kodell RL, Hauer-Jensen M. Practical advice on calculating confidence intervals for radioprotection effects and reducing animal numbers in radiation countermeasures experiments. Radiation Research 2013; 180:567–574, DOI: 10.1667/RR13429.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Le C, Grambsch P, Liu A. Estimation of cancer drug potencies and relative potencies from in vitro data. Journal of Biopharmaceutical Statistics 2005; 15:903–912. [DOI] [PubMed] [Google Scholar]

[R11] Sitter RR, Wu CFJ. On the accuracy of Fieller intervals for binary response data. Journal of the American Statistician 1993; 88:1021–1025. [Google Scholar]

[R12] USDA (United States Department of Agriculture). Supplemental assay method for potency testing of inactivated rabies vaccine in mice using the National Institutes of Health test. SAM 308.04 2011; 12 pages. Online at http://www.aphis.usda.gov/animal_health/vet_biologics/publications/308.pdf (accessed 5 November 2014).

[R13] Wilbur LA, Aubert MFA. The NIH test for potency In: Meslin FX, Kaplan MM, Koprowski H, editors. Laboratory Techniques in Rabies, 4th ed Geneva, World Health Organization, 1996 [Google Scholar]

[R14] Willams DA. An exact confidence interval for the relative potency estimated from a multivariate bioassay. Biometrics 1988; 44:861–867. [Google Scholar]

[R15] William DA. Interval estimation of the median lethal dose. Biometrics 1986; 42:641–645. [PubMed] [Google Scholar]

PERMALINK

INTERVAL ESTIMATORS OF RELATIVE POTENCY IN TOXICOLOGY AND RADIATION COUNTERMEASURE STUDIES: COMPARING METHODS AND EXPERIMENTAL DESIGNS

Reid D Landes

Shelly Y Lensing

Martin Hauer-Jensen

Abstract

1. INTRODUCTION

2. MODELS, METHODS, AND DESIGNS

2.1. Model

2.2. Methods

2.2.1. Bayes’ Method

2.2.2. Bootstrap

2.2.3. Inverted Likelihood Ratio Test

2.2.4. Fieller’s Method

2.2.5. Wald’s Method

2.3. Designs

3. SIMULATIONS FOR STUDYING COVERAGE PROBABILITIES, POWER, AND DESIGN ISSUES

3.1. Coverage-Probability Study

3.2. Design-Power Study

3.3. Analyses of Simulated Data

4. RESULTS OF SIMULATION STUDIES

4.1. Coverage Probabilities of Interval Estimators of RP

Table 1.

Table 2.

4.2. Power for Detecting Effect Size > 0

Figure 1.

Figure 2.

4.3. Dose Design Effect on Power

Figure 3.

5. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases