ABSTRACT
In one-way analysis of variance models, performing simultaneous multiple comparisons of treatment groups with a control group may be of interest. Dunnett's test is used to test such differences and assumes equal variances of the response variable for each group. This assumption is not always met even after transformation. A parametric bootstrap (PB) method is developed here for comparing multiple treatment group means against the control group with unequal variances and unbalanced data. In simulation studies, the proposed method outperformed Dunnett's test in controlling the type I error under various settings, particularly when data have heteroscedastic variance and unbalanced design. Simulations show that power is often lower for the PB method than for Dunnett's test under equal variance, balanced data, or smaller sample size, but similar to or higher than for Dunnett's test with unequal variance, unbalanced data and larger sample size. The method is applied to a dataset concerning isotope levels found in elephant tusks from various geographical areas. These data have very unbalanced group sizes and unequal variances. This example illustrates that the PB method is easy to implement and avoids the need for transforming data to meet the equal variance assumption, simplifying interpretation of results.
KEYWORDS: Dunnett's test, simulations, ANOVA, HeteANOVA, unbalanced data
MATHEMATICAL SUBJECT CLASSIFICATION: ANOVA
1. Introduction
Consider a one-way analysis of variance (ANOVA) problem with a treatment groups, where the first group is a control group. Let be the value of the response variable in the jth trial for the ith factor level, the mean for the ith factor level, . The one-way ANOVA model is as follows:
| (1) |
where , and .
One may wish to perform multiple comparisons of the treatment groups with the control group, rather than performing all pairwise comparisons. In this case, procedures such as the Tukey-Kramer test [16], which examine all pairwise comparisons among the treatment groups, can result in confidence intervals that are wider than necessary [7].
1.1. Dunnett's test
Under the equal variance assumption, Dunnett's test [5,7,19], which uses a procedure similar to Tukey's, can be used for comparing treatment groups only against the control. It is frequently used in clinical or pharmacological studies [4,18,24,25]. Dunnett's test compares a−1 pairs (each group with the control group), instead of the pairs involved in all pairwise comparisons. Dunnett's test uses the statistic
| (2) |
where is the sample mean for group i, , with the parameter associated with the control group, and is the pooled variance estimate , which can be written as with .
When each pair is considered alone, the test statistic in (2) has a t distribution. However, when comparing all treatment groups against the control, the a−1 pairs form a family, and the critical value for constructing confidence intervals and determining statistical significance must be higher in order to keep the type I error rate for the entire family less than α [7,19]. Dunnett [7] used the multivariate t distribution to obtain critical values that satisfy Pr , where the are the test statistics in (2) for each treatment group. Dunnett [7] points out that there are infinitely many solutions for the , but set . More details of this derivation can be found in Dunnett [7].
As described in Christensen [5] and Miller Jr. [19], Dunnett's test is based on knowing the distribution of the maximum over i of the test statistic in Equation (2) when the null hypothesis of equality of all means is true. That is, (2) will be less than a critical value only when the maximum of (2) over all a−1 pairs is less than . When variances are equal, so the pooled variance estimate is appropriate for all treatment groups, and data are balanced, (2) may be compared to the critical values found in the tables by Dunnett [7,8]. These critical values for use in various practical scenarios are also available in software, e.g. the DunnettTest function in the R package DescTools [22]. When the assumption of equal variance is violated, we can modify the test statistic to include the separate variance estimates as shown in the next section. However, the modified test statistic can no longer be compared to, or used to construct confidence intervals with, critical values obtained from a known distribution.
When the assumption of equal variance is violated and data are unbalanced (hereafter called HeteANOVA problem), the results of Dunnett's test are questionable. Many alternative methods were developed for the classical F-test and multiple comparisons for HeteANOVA problems [17,26,29]. Among them, the parametric bootstrap (PB) [17,26] test is shown to be one of the best for testing equality of factor level means [27]. Recently, [29,30] proposed PB multiple comparison tests for one-way and two-way ANOVA, which are shown to be competitive.
1.2. Parametric bootstrap
The following is largely paraphrased from Efron and Tibshirani [11]; additional details can be found there and in many other references such as Efron [9], Shao and Tu [21], Bickel and Freedman [2], and Hall [15]. The bootstrap is a method for assigning measures of accuracy to statistical estimates. Consider a random sample of size n, , from a probability distribution F. We would like to estimate some quantity of interest of F, for example, the mean or median, from this random sample. The empirical distribution function is the discrete distribution that puts probability 1/n on each value , . One way to estimate the desired quantity from F, say , is to estimate the corresponding quantity from (the plug-in principle). Thus, the plug-in estimate of θ is . The bootstrap method is an application of the plug-in principle. The accuracy of can be estimated by the bootstrap using the plug-in principle to estimate the standard error (SE) of a statistic. For example, the SE of the mean is , where is the true standard deviation and is typically unknown. The plug-in estimate of is . Then the estimated SE . The bootstrap estimate of is a plug-in estimate that uses the empirical distribution function in place of the unknown distribution F. Bootstrap methods employ a bootstrap sample: a random sample of size n drawn from the empirical distribution , where the probability 1/n is put on each of the observed values as described earlier. The bootstrap sample can be denoted , where is a resampled version of . The bootstrap datapoints are a random sample of size n drawn with replacement from . From a bootstrap dataset , a ‘bootstrap replication’ of can be calculated: –applying the same function to that was applied to . By drawing many independent bootstrap samples and evaluating the bootstrap replications of , the standard error of can be estimated by the empirical standard deviation of the replications, where B is the number of replications. The limit of as B approaches infinity is , the bootstrap estimate of . Additionally, the quantity has a variance itself which approaches 0 as both n and B approach infinity. By the Glivenko-Cantelli theorem, the empirical distribution function converges uniformly almost surely to the true distribution function [12,23].
Approximations obtained by random sampling or simulation are called Monte Carlo estimates. The bootstrap distribution can be calculated theoretically or using Monte Carlo approximation [9]. A Monte Carlo algorithm is commonly used to numerically evaluate the bootstrap estimate of the standard error [10]. An advantage of the bootstrap is that it can be applied to some statistics other than the mean , where the SE may not have a known form as with the SE of the mean.
The estimate and its approximation are sometimes referred to as nonparametric bootstrap estimates because they are based on , the nonparametric estimate of F. Bootstrap sampling can also be performed parametrically. The difference between the nonparametric bootstrap and the parametric bootstrap is that for parametric bootstrap, the samples are drawn from a parametric estimate of the population, , rather than the non-parametric estimate . For the non-parametric bootstrap setup, F is completely unknown; that is, only the random sample is available. is an estimate of F derived from a parametric model; that is, instead of estimating F by the empirical distribution , one could assume some distribution, such as normal, for the population. Instead of sampling with replacement from the data, we draw B samples of size n from the parametric estimate of the population . The parametric bootstrap can be used when some knowledge is available about the form of the underlying distribution (though the parameters of that distribution would be unknown [15]). The parametric bootstrap depends on the parametric model assumption. The nonparametric bootstrap is “model assumption free”, but it is not as efficient as the parametric bootstrap when the parametric model is correct. In general, the performance of the bootstrap relies on how well we can identify and estimate the model [21]. For example, in estimating the unknown parameters, the sample mean and variance are consistent estimators of the population mean and variance [3]. Thus, this estimate and correctness of the parametric model would improve with increasing sample size n.
Parametric bootstrap has been used for one-way and two-way ANOVA [17,26] as well as multiple comparison procedures [29,30]. While the test statistics in these procedures do not follow known distributions, they are functions of parameter estimates that do follow known distributions (normal and chi-squared as discussed in the next section). Thus, a distribution for the test statistic can be obtained by computer simulation.
Inspired by Dunnett's test and PB tests, in this research, we develop a PB test that is a special case of the PB multiple comparison procedures described in [29,30]. The PB test presented here is analogous to Dunnett's test, which performs simultaneous multiple comparisons of the treatment groups with the control for the HeteANOVA problem. This research is organized as follows: Section 2 proposes the methodology and presents the algorithm; Section 3 performs simulation studies to evaluate type I error and power; Section 4 gives a real example; and Section 5 gives conclusions and discussion of the research.
2. Proposed PB test and algorithm
In this section, we develop a PB method for multiple comparisons of treatment groups with the control group for a HeteANOVA problem and present an algorithm to implement the test.
2.1. Proposed PB test
Following the procedure from previous papers [17,29], consider the test statistic in Equation (2). We modify this to include the different group variances:
| (3) |
where and . As noted previously, this test statistic no longer follows a known distribution for comparison; the aim of the PB method is to simulate this distribution. The test we consider is location invariant, so we assume without loss of generality that the mean of is zero for all i. Then and the sample variance [3]. These can be approximately simulated by pivot variables , or equivalently, , where Z is a N(0, 1) random variable, and .
We can replace and in Equation (3) with and to obtain a PB pivot variable:
| (4) |
As discussed in Section 1.1, Dunnett's test uses critical values from the distribution of the maximum over i of the test statistic in Equation (2) when the null hypothesis of equality of all means is true and treatment groups have equal variances. For the PB method, we simulate a distribution for the test statistic (3), using (4). With this simulated distribution, we can estimate the p-value or obtain a critical value which can be used to construct confidence intervals. The procedure is shown in the following algorithm, and example code for this algorithm is shown in the appendix.
2.2. Parametric bootstrap algorithm for comparing multiple treatment groups with control
3. Simulations
3.1. Type I error
To evaluate the performance of the algorithm, we simulated 2500 datasets with and for all i, such that is true, and compared the rejection rate for both Dunnett's Test, using the DunnettTest function in the R package DescTools [22] and the PB method (Algorithm 1) with L = 5000 bootstrap sample mean and variance vectors. We used a = 6 treatment groups including the control, with , and The sample size vectors used in the simulations were , , , , and The simulation settings follow from [30], and were chosen to include one variance vector where variances were equal and three sample size vectors that reflect smaller, moderate, and relatively large sizes of balanced data. The other vectors include unequal variance and unbalanced data. In this way, we could compare performance of the traditional Dunnett's test with the proposed PB method under conditions where the equal variance assumption is met against cases where it is not. Previous work has also shown traditional multiple comparison procedures to be robust to this violation when data are balanced but to perform more poorly with unbalanced data [30]. All calculations, simulations and data analysis were performed using R [20].
Results are shown in Table 1. With the equal variance assumption met ( ), both Dunnett's test and the PB test give acceptable results. Under equal variance, the type I error rates are lower for the PB test than Dunnett's test for and , outside the 95% Monte Carlo error interval for and even in the unequal variance cases, likely because the simulated PB distribution could have a larger variance with smaller sample sizes. As noted in Section 1.2, the accuracy of the parametric model would be expected to be worse with smaller sample sizes. This is not noted for , which has larger sample sizes. Additionally, Dunnett's test performs satisfactorily in most heteroscedastic cases where data are balanced (though not with unbalanced data). The exception to this is for . In this case, the type I error rate for Dunnett's test is higher than the nominal level even with balanced data. This variance vector includes 0.01 which is small, likely leading to an artificially small pooled variance estimate and thus an artificially large test statistic, so the test rejects more often than the nominal level.
Table 1.
Simulation results for multiple comparisons of treatment group means vs. control.
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
|---|---|---|---|---|---|---|---|---|
| 0.0516 | 0.0420 | 1.6926 | 0.7251 | 0.1024 | 0.0832 | 1.6792 | 0.6980 | |
| 0.0464 | 0.0360 | 1.5634 | 0.4469 | 0.1012 | 0.0928 | 1.5524 | 0.4632 | |
| 0.0448 | 0.0384 | 1.8204 | 1.5017 | 0.0976 | 0.0752 | 1.7969 | 1.4694 | |
| 0.0584 | 0.0592 | 1.5562 | 0.8392 | 0.1028 | 0.0988 | 1.5284 | 0.8229 | |
| 0.0484 | 0.0492 | 1.4720 | 0.3471 | 0.1000 | 0.0956 | 1.4653 | 0.3562 | |
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
| 0.0464 | 0.0444 | 1.8568 | 0.8214 | 0.0888 | 0.0884 | 1.8485 | 0.8098 | |
| 0.0420 | 0.0460 | 1.6676 | 0.4658 | 0.0848 | 0.0996 | 1.6641 | 0.4863 | |
| 0.0100 | 0.0364 | 1.9029 | 1.2555 | 0.0324 | 0.0736 | 1.8837 | 1.2162 | |
| 0.0016 | 0.0540 | 1.6289 | 0.6557 | 0.0076 | 0.0900 | 1.6047 | 0.6398 | |
| 0.0404 | 0.0520 | 1.5575 | 0.3568 | 0.0748 | 0.0964 | 1.5474 | 0.3545 | |
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
| 0.0704 | 0.0404 | 1.6311 | 0.7306 | 0.1260 | 0.0792 | 1.6167 | 0.7047 | |
| 0.0752 | 0.0364 | 1.5121 | 0.4572 | 0.1348 | 0.0916 | 1.4960 | 0.4689 | |
| 0.1000 | 0.0412 | 1.8092 | 1.7132 | 0.1720 | 0.0796 | 1.7813 | 1.7004 | |
| 0.1408 | 0.0592 | 1.5376 | 0.9661 | 0.2080 | 0.1008 | 1.5086 | 0.9416 | |
| 0.0712 | 0.0432 | 1.4190 | 0.3524 | 0.1276 | 0.0944 | 1.4127 | 0.3631 | |
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
| 0.0456 | 0.0472 | 1.9028 | 0.8569 | 0.0768 | 0.0948 | 1.8983 | 0.8479 | |
| 0.0424 | 0.0424 | 1.6984 | 0.4741 | 0.0716 | 0.0980 | 1.6974 | 0.4939 | |
| 0.0104 | 0.0376 | 1.9589 | 1.3092 | 0.0268 | 0.0764 | 1.9551 | 1.3084 | |
| 0.0004 | 0.0496 | 1.6583 | 0.5980 | 0.0024 | 0.0904 | 1.6361 | 0.5890 | |
| 0.0368 | 0.0516 | 1.5824 | 0.3614 | 0.0640 | 0.0948 | 1.5742 | 0.3553 | |
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
| 0.0348 | 0.0436 | 1.8183 | 0.7893 | 0.0764 | 0.0904 | 1.8100 | 0.7661 | |
| 0.0320 | 0.0412 | 1.6475 | 0.4501 | 0.0632 | 0.0936 | 1.6414 | 0.4753 | |
| 0.0184 | 0.0372 | 1.9176 | 1.4411 | 0.0504 | 0.0756 | 1.9142 | 1.4254 | |
| 0.0096 | 0.0556 | 1.6325 | 0.6940 | 0.0324 | 0.0944 | 1.6106 | 0.6925 | |
| 0.0316 | 0.0512 | 1.5477 | 0.3497 | 0.0668 | 0.0960 | 1.5402 | 0.3507 | |
| Dunnett | PB | V( ) | Dunnett | PB | V( ) | |||
| 0.0996 | 0.0540 | 2.0542 | 1.1788 | 0.1412 | 0.1068 | 2.0529 | 1.1882 | |
| 0.1000 | 0.0416 | 1.7491 | 0.5207 | 0.1320 | 0.1000 | 1.7509 | 0.5460 | |
| 0.0308 | 0.0412 | 2.1341 | 1.7580 | 0.0544 | 0.0984 | 2.1664 | 1.9283 | |
| 0.0016 | 0.0460 | 1.7351 | 0.5764 | 0.0032 | 0.0880 | 1.7170 | 0.5785 | |
| 0.0964 | 0.0492 | 1.6120 | 0.3796 | 0.1272 | 0.0892 | 1.6049 | 0.3629 |
Notes: Numbers in the table are empirical type I error rates. We consider five different sample sizes and six different variance vectors as shown in Section 3, with the two different α levels shown. refers to the mean test statistic (3) over all simulated datasets and V( ) to its variance.
The PB test outperforms Dunnett's test, with type I error rates close to the nominal level for all simulation settings, including those settings with unequal variance and unbalanced data. In all heteroscedastic cases except , the proportion rejected for Dunnett's test is too conservative (less than the nominal level) when the data are unbalanced. In these cases, the smaller variances in the simulations are for groups with smaller sample sizes, and larger variances for groups with larger sample sizes. For these settings, the pooled variance is artificially large, leading to a test statistic that is artificially small. The opposite is true for , which assigns smaller variances to larger group sizes, so the pooled variance estimate is too small and the test statistic too large.
In Dunnett's test statistic, the pooled variance estimate appears in the denominator. Each group's sample variance is weighted by its sample size (minus 1) in the pooled variance estimate. Thus, if a group with a relatively large sample size has a relatively large variance (relative to the other groups), the pooled variance estimate will tend to be large and Dunnett's test statistic will tend to be small, making Dunnett's test too conservative. This is observed in the type I error simulation results for the unbalanced data settings ( and ) for all unequal variance settings except for , with type I error rates well below the nominal level for Dunnett's test. For , where the relatively large sample-sized group has a relatively small variance, the pooled variance estimate will tend to be small and the test statistic large, leading to inflated type I error rates for Dunnett's test. This is observed in simulations for the unbalanced data settings with .
When data are balanced, Dunnett's test is fairly robust in the simulations to this issue of the pooled variance estimate being too large or too small. The simulations illustrate this in all settings with balanced data except for those paired with , which has a much larger difference in group variances than the other unequal variance settings. In this case, the type I error rate is far from the nominal level even with balanced data. With the unbalanced data and , the small variance belongs to the smallest sample size groups and is ‘outweighed’ in the pooled variance estimate by the largest variance belonging to the largest sample size groups, which again shrinks the test statistic, making Dunnett's test too conservative.
3.2. Evaluation of power
To evaluate the performance of the algorithm in terms of power, we simulated 2500 datasets for each combination of settings with and or , such that is not true, and compared the rejection rate for both Dunnett's Test, using the DunnettTest function in the R package DescTools [22] and the PB method (Algorithm 2.1) with L = 5000 bootstrap sample mean and variance vectors. We used a = 6 treatment groups including the control, with the same sample size and variance vectors as in the simulations in Section 3.1. The simulation settings follow from [26,30].
Results are shown in Table 2. With equal variance and balanced data ( and ; ), power was similar between the two methods or somewhat lower for the PB version. With unequal variance and unbalanced data, power was often lower for the PB version with the smaller sample size vector ( ) but somewhat higher for the PB version with the larger sample size vector ( ). As expected, the power for both tests is generally higher with the mean vector as this has a larger difference between groups, and with the sample size vectors and , as these are larger sample sizes for most groups. Dunnett's test generally had higher power with ; however, it also had inflated type I error rate for this setting, particularly with unbalanced data, so the power cannot be directly compared. As noted with the results from the type I error simulations, this variance vector assigns larger group variances to smaller sample sizes for these simulation settings, so the pooled variance estimate tends to be smaller and the test statistic larger, leading to Dunnett's test rejecting more often.
Table 2.
Simulation results: mult. comparisons of treatment group means vs. control – power.
| Dunnett | PB | Dunnett | PB | ||
|---|---|---|---|---|---|
| 0.1460 | 0.0948 | 0.3012 | 0.1904 | ||
| 0.2660 | 0.2140 | 0.5720 | 0.4804 | ||
| 0.1192 | 0.0688 | 0.2288 | 0.1144 | ||
| 0.1944 | 0.1624 | 0.4304 | 0.3076 | ||
| 0.8256 | 0.8140 | 0.9976 | 0.9980 | ||
| Dunnett | PB | Dunnett | PB | ||
| 0.4288 | 0.2980 | 0.7912 | 0.6236 | ||
| 0.7676 | 0.6936 | 0.9888 | 0.9800 | ||
| 0.2240 | 0.1968 | 0.5928 | 0.4600 | ||
| 0.4208 | 0.7008 | 0.9236 | 0.9656 | ||
| 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
| Dunnett | PB | Dunnett | PB | ||
| 0.2016 | 0.1192 | 0.4108 | 0.2604 | ||
| 0.3760 | 0.2872 | 0.7396 | 0.6312 | ||
| 0.2144 | 0.0820 | 0.3868 | 0.1572 | ||
| 0.3964 | 0.1740 | 0.6940 | 0.3292 | ||
| 0.9384 | 0.9232 | 1.0000 | 1.0000 | ||
| Dunnett | PB | Dunnett | PB | ||
| 0.3304 | 0.1928 | 0.6208 | 0.3992 | ||
| 0.5772 | 0.4308 | 0.9264 | 0.8160 | ||
| 0.1612 | 0.1264 | 0.4164 | 0.2744 | ||
| 0.2148 | 0.5168 | 0.7296 | 0.8940 | ||
| 0.9932 | 0.9916 | 1.0000 | 1.0000 | ||
| Dunnett | PB | Dunnett | PB | ||
| 0.2000 | 0.1516 | 0.4348 | 0.3136 | ||
| 0.3844 | 0.3492 | 0.7908 | 0.7188 | ||
| 0.1196 | 0.0880 | 0.3084 | 0.1852 | ||
| 0.1876 | 0.2980 | 0.5716 | 0.6296 | ||
| 0.9572 | 0.9640 | 1.0000 | 1.0000 | ||
| Dunnett | PB | Dunnett | PB | ||
| 0.5204 | 0.4060 | 0.8132 | 0.7604 | ||
| 0.7756 | 0.8620 | 0.9812 | 0.9972 | ||
| 0.3340 | 0.2944 | 0.6468 | 0.6624 | ||
| 0.4564 | 0.9500 | 0.9232 | 1.0000 | ||
| 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Power was often lower for the PB test, particularly in the settings with small sample sizes and when comparison is restricted to settings where the type I error rates are similar between the two tests. However, with the mean vector that includes larger differences between means ( ) and larger sample sizes ( and ), both tests demonstrated acceptable power in most of the variance settings. For the smaller mean vector and setting and , Dunnett's test had higher power than the PB test, but also had a type I error rate well above the nominal level for that setting. Similarly, for and , the PB test had higher power, but also had type I error rate near the nominal level while the type I error rate for Dunnett's test was well below the nominal level for that setting.
4. Application
An example of the method is shown by applying it to the data found in the supplementary material of Ziegler et al. [31]. The data concerns isotope levels in elephant tusks from different geographical areas. Summary statistics are shown in Table 3. The sample sizes and variances are somewhat similar to the simulation setting (for evaluating type I error) of with , although in that simulation setting the largest group had the largest variance which is not true for this example dataset. The groups are reordered in Table 3 to ease this comparison. We considered Asia to be the ‘control’ group as the other regions were in Africa. While Ziegler et al. [31] examined all pairwise comparisons of the different regions using the Games-Howell post-hoc test [14], another possible question of interest could be whether any of the African regions differ from Asia (rather than additionally comparing all of the African regions with each other). The data are very unbalanced, and we can see from Table 3 and Figure 1 that the variances appear unequal for the N isotope ratio (nitrogen stable isotope ratios expressed in δ units). Ziegler et al. [31] looked at several other isotopes and performed additional classification procedures, but we limited the analysis in this study to one isotope simply to illustrate the method.
Table 3.
Summary statistics, N, elephant tusk data.
| Region | |||
|---|---|---|---|
| Asia | 8 | 8.49 | 1.62 |
| East Africa | 37 | 9.78 | 6.15 |
| West Africa | 69 | 5.85 | 1.40 |
| Central Africa | 120 | 9.37 | 3.71 |
| Southern Africa | 261 | 8.93 | 2.78 |
Figure 1.
delta15N by region, elephant tusk data.
We fit the one-way ANOVA model and then checked assumptions of normality and constant variance. By the Shapiro-Wilk test for normality using the shapiro.test function in R (W = 0.985, p-value near 0), and examination of a normal plot of the residuals, the normality assumption was violated. The fitted-residual plot from the ANOVA model indicated violation of the equal variance assumption. We also performed the Breusch-Pagan (BP) test using the function bptest from the R package lmtest [28] and Levene's test using the leveneTest function from the R package car [13]. The p-value from both formal tests for equal variance were near 0 and the fitted-residual plot indicated unequal group variances.
A log transformation was attempted to satisfy assumptions. The normality assumption was then satisfied by appearance of the normal plot and formally by the Shapiro-Wilk test, with W = 0.996 and p-value = 0.333. The fitted-residual plot was somewhat improved after transformation but still appeared to violate the equal variance assumption. The p-values for the BP test and Levene's test were 0.031 and 0.013, respectively, for the transformed data. The fitted-residual plots before and after transformation are shown in the appendix.
We performed Dunnett's test, using the previously mentioned function in R, on the untransformed and the log transformed data; both found a significant difference in N isotope levels (nitrogen stable isotope ratios expressed in δ units) between Asia and West Africa, but not the other African regions studied. We then performed the analogous PB test, which came to the same conclusion. The differences between means, confidence intervals and p-values are shown in Tables 4 and 5 for Dunnett's test and Table 6 for the PB test. An illustration of the method is depicted in Figure 2, as a histogram of the PB simulated null distribution with its critical value and the test statistic shown for comparing Asia to West Africa. We note that Ziegler et al. [31] also found a significant difference between Asia and East and Central Africa for this isotope (see Table A4 in their supplementary material). The data they report in their results (see Table 1 of [31]) contained 507 observations including 20 from Asia, while the data we used from their supplementary material contained 495 observations, with only 8 observations from Asia. Additionally, in the supplementary material data, Rwanda appears to be classified as part of Central Africa, but in their Table 1, it is classified as East Africa. So, it is possible that the difference between our findings and those of Ziegler et al. [31] could be due to these differences in sample sizes, with the missing observations coming from Asia. Finally, Ziegler et al. [31] state that, with some exceptions, single isotope markers alone are of little usefulness for forensic purposes, so we emphasize that findings from this study for one particular isotope would be best combined with other results for practical application. They also state several biological and environmental factors to consider when interpreting findings. However, in the interest of brevity, we chose to illustrate the proposed method with only one of the isotope ratios.
Table 4.
Results from Dunnett's test, elephant data.
| Diff | Lower CI | Upper CI | p-value | |
|---|---|---|---|---|
| Central Africa-Asia | 0.89 | −0.53 | 2.31 | 0.27 |
| East Africa-Asia | 1.29 | −0.23 | 2.80 | 0.11 |
| Southern Africa-Asia | 0.44 | −0.96 | 1.83 | 0.69 |
| West Africa-Asia | −2.64 | −4.09 | −1.19 | 0.00 |
Table 5.
Results from Dunnett's test, log elephant data.
| Diff | Lower CI | Upper CI | p-value | |
|---|---|---|---|---|
| Central Africa-Asia | 0.09 | −0.07 | 0.25 | 0.35 |
| East Africa-Asia | 0.12 | −0.05 | 0.29 | 0.21 |
| Southern Africa-Asia | 0.04 | −0.12 | 0.20 | 0.77 |
| West Africa-Asia | −0.38 | −0.54 | −0.22 | 0.00 |
Table 6.
Results from PB test, elephant data.
| Diff | Lower CI | Upper CI | p-value | |
|---|---|---|---|---|
| Central Africa-Asia | 0.89 | −0.42 | 2.19 | 0.19 |
| East Africa-Asia | 1.29 | −0.35 | 2.93 | 0.12 |
| Southern Africa-Asia | 0.44 | −0.81 | 1.69 | 0.61 |
| West Africa-Asia | −2.64 | −3.91 | −1.36 | 0.00 |
Figure 2.
PB distribution, elephant tusk data.
In this case, while the log transformation corrected the violation of the normality assumption, it could not completely correct the violation of the equal variance assumption though the fitted-residual plot was somewhat improved. Both methods came to the same conclusion for these data. However, Dunnett's test requires the equal variance assumption, so conclusions based on it can be questionable when this assumption is violated. Dunnett's test appears from the simulations to be robust to this violation in most cases when data are balanced, but not when data are unbalanced. In this example dataset, the groups with largest and smallest variance have relatively moderate sample sizes, and the group with the largest sample size has a relatively moderate variance, so the test statistic for Dunnett's test was not made excessively large or small, and it came to the same conclusion as the PB test. However, the PB test does not require the equal variance assumption, so violating this assumption does not call its results into question. The PB test also avoids the need for transformation, which can make interpretation of results more difficult.
Additionally, for the difference between Asia and West Africa, the PB method produced narrower confidence intervals. In this case, the mean squared error from the ANOVA model, which would be used as a pooled variance estimate in the traditional Dunnett's test, was 3.046, which is much larger than the variance for Asia and West Africa in particular, so would produce a test statistic that is smaller than necessary (and thus less likely to reject the null hypothesis of no difference between these two groups). This is somewhat similar to the simulation setting (for evaluating type I error) of with , although in that simulation setting the largest group had the largest variance; in the elephant dataset, the largest group size had the third largest variance. Still, in that simulation setting, Dunnett's test was too conservative due to the pooled variance estimate being too large for some groups, which is similar to the findings here of confidence intervals for Dunnett's test being wider than those of the PB test for the Asia and West Africa comparison.
While the PB method uses the normality assumption, it uses group means in its calculations, which should be approximately normal regardless of the distribution of the individual observations, at least for large samples, by the Central Limit Theorem [3], so it is plausible that the PB test could also be robust to violations of the normality assumption.
5. Conclusions and discussion
In this research, we looked at Dunnett's test from a parametric bootstrap view and proposed a PB test for comparing treatment groups with the control. Simulation results show that both Dunnett's test and the PB test give acceptable results under the equal variance assumption. Additionally, when data are balanced, Dunnett's test performs satisfactorily in most heteroscedastic cases. However, for heteANOVA problems when equal the variance assumption is violated and data are unbalanced, Dunnett's test tends to have type I error rates different from nominal levels, while the proposed PB method's type I error rates were near nominal levels. From the example, we see that the classical way of transformation to deal with unequal variance is not guaranteed and interpretation of the results after transformation can be difficult. The proposed PB test is robust to violation of equal variance and balanced design, and it is easy to implement.
While Dunnett's test performed satisfactorily with most balanced data cases in simulations, the rejection rate can be much higher or lower than the nominal level for the heteANOVA problem. One reason for this is that depending on sample size, if one group's variance is much smaller than the others, the pooled variance estimate will be too large for that group, leading to an artificially small test statistic. Similarly, if one group's variance is much larger than the others, the pooled variance estimate will be too small for that group, leading to an artificially large test statistic. This issue can be amplified when a group with a relatively large sample size also has a variance that is particularly larger or smaller than those of the other groups.
One limitation of the proposed PB method is that it requires the normality assumption, so if a particular dataset violates both assumptions, a transformation may still be needed; however, as in the example here, the PB method may be robust to violation of the normality assumption as well, particularly for large sample sizes.
Another limitation for the PB test is that, in the simulation results, it tended to have low power for sample sizes and . Because the power for the PB test tended to be low for these smaller sample sizes, a researcher might gain insight by looking at both the PB test and Dunnett's test when sample sizes are small. If the PB test does not reject but Dunnett's test does, and the groups with large sample sizes (relative to the other groups) also have relatively small variances, one might suspect that Dunnett's test is making a type I error, as shown in the simulations with , particularly with the unbalanced data cases. If the larger sample size groups have larger variances as in several of the simulation settiings, Dunnett's test is less likely to make a type I error. However, a limitation of this study is that we do not determine precisely how large the differences in variance must be or how unbalanced the data must be for Dunnett's test to no longer be robust to the violation. One of the assumptions of Dunnett's test is equal variance, and when this is violated, a researcher should not have high confidence in the conclusions, as with any statistical method whose assumptions are violated.
The simulation results do suggest that Dunnett's test is often robust to violation of the equal variance assumption when data are balanced (but not always, as in the case of , which had larger differences in variances and Dunnett's test showed inflated type I error, even with balanced data). Therefore, we would not recommend using Dunnett's test alone in the setting of unequal variance and unbalanced data. However, based on these simulation results, sample sizes less than 20 (the largest sample size in ) warrant caution about the power for the PB test. This did not appear to be an issue for sample sizes of 20 or larger: for (the larger of the two unbalanced data vectors), the PB test had higher power in all unequal variance settings except , where the small variance paired with the largest sample size leads to an inflated test statistic and therefore inflated type I error rate, along with higher power, for Dunnett's test. It is difficult to compare power between two tests whose type I error rates are different, but in this case, when the test statistic is inflated by a larger group having a smaller variance, Dunnett's test tends to reject more often whether under the null or the alternative hypothesis, so both its type I error rate and its power tend to be increased.
If a researcher encounters a situation with unequal variances where the PB test does not reject but Dunnett's test does, where (unbalanced) sample sizes are less than 20 and the larger groups do not have relatively smaller variances, they may suspect that the PB test is making a type II error. In this case a reasonable approach would be to transform the data to meet the equal variance assumption and then perform Dunnett's test on the transformed data. However, with sample sizes of 20 or larger, unequal variance, and unbalanced data, the PB method is a viable option and avoids the need for transformation.
Additionally, as described in [6] section 4.3, we may need to exercise caution when making practical decisions based on differences in means between groups with unequal variances. For example, if a lower value of a response is desired, such as blood pressure, a treatment group with a smaller mean and smaller variance may have a smaller probability of achieving the desired outcome than a treatment group with a larger mean but also larger variance. Thus, additional consideration of implications for the practical issue being studied is warranted. With the smallest sample size vector that include group sizes as small as 3, the PB test did have somewhat lower power than Dunnett's test, and type I error rates were often below the nominal level. This is likely due to larger variance in the bootstrap distribution for the test statistic; as shown in Table 1, the variances of the test statistics for were large. While these were from the test statistics from each simulated dataset, the idea of this application of PB is that the distribution of the test statistic is the same as that simulated. The large variance for the test statistic from settings with small sample size vectors was also noted for most of the simulations under alternative hypotheses for evaluation of power (these variances not shown). Despite these limitations, the proposed PB test is a viable method for performing multiple comparisons of treatment vs control for the heteANOVA problem and in simulation settings for this study, controlled the type I error rate well. Areas for future study could include more complex study designs. A simple step-up in complexity could be a two-factor ANOVA model where comparing a group with both factors at baseline with each combination of factor levels was desirable (rather than all pairwise comparisons of the factor combinations). Recently, an R package was developed that includes the algorithm presented here [1].
Appendix.
R Code: The following code is one way to program the PB test (Algorithm ) to simulate a distribution for the PB test statistic. The output here is the test statistic and the p-value, but could be modified to return other values, for example, or confidence intervals. Processing time was checked for for select scenarios ( and with and with and from the simulation studies). For one simulated dataset with these scenarios, the maximum processing time was 2.559 seconds. With L = 5000, the maximum processing time for one of these datasets was 0.164 seconds.
Figure A1.
Fitted-residual plots before and after transformation, elephant data.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Alver S. and Zhang G., pbANOVA: Parametric Bootstrap for ANOVA Models, 2022. https://CRAN.R-project.org/package=pbANOVA. R package version 0.1.0.
- 2.Bickel P.J. and Freedman D.A., Some asymptotic theory for the bootstrap, Ann. Stat. 9 (1981), pp. 1196–1217. [Google Scholar]
- 3.Casella G. and Berger R.L., Statistical Inference, 2nd ed., Duxbury, Pacific Grove, CA, 2002. [Google Scholar]
- 4.Cheng W.S.C., Murphy T.L., Smith M.T., Cooksley W.G.E., Halliday J.W., and Powell L.W., Dose-dependent pharmacokinetics of caffeine in humans: Relevance as a test of quantitative liver function, Clin. Pharmacol. Ther. 47 (1990), pp. 516–524. 10.1038/clpt.1990.66. [DOI] [PubMed] [Google Scholar]
- 5.Christensen R., Analysis of Variance, Design, and Regression: Applied Statistical Methods, Chapman and Hall/CRC, Boca Raton, FL, 1996. [Google Scholar]
- 6.Christensen R., Analysis of Variance, Design, and Regression: Linear Modeling for Unbalanced Data, 2nd ed., CRC Press, Boca Raton, FL, 2016. [Google Scholar]
- 7.Dunnett C.W., A multiple comparison procedure for comparing several treatments with a control, J. Am. Stat. Assoc. 50 (1955), pp. 1096–1121. http://www.jstor.org/stable/2281208. [Google Scholar]
- 8.Dunnett C.W., New tables for multiple comparisons with a control, Biometrics 20 (1964), pp. 482–491. [Google Scholar]
- 9.Efron B., Bootstrap methods: Another look at the jackknife, Ann. Stat. 7 (1979), pp. 1–26. [Google Scholar]
- 10.Efron B. and Tibshirani R., Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Stat. Sci. 1 (1986), pp. 54–75. [Google Scholar]
- 11.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, Chapman & Hall, New York, NY, 1993. [Google Scholar]
- 12.Ferguson T.S., A Course in Large Sample Theory, Chapman & Hall, London, UK, 1996. [Google Scholar]
- 13.Fox J. and Weisberg S., An R Companion to Applied Regression, 3rd ed., Sage, Thousand Oaks, CA, 2019. Available at https://socialsciences.mcmaster.ca/jfox/Books/Companion/. [Google Scholar]
- 14.Games P.A. and Howell J.F., Pairwise multiple comparison procedures with unequal N's and/or variances: A monte carlo study, J. Educ. Stat. 1 (1976), pp. 113–125. [Google Scholar]
- 15.Hall P., The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York, NY, 1992. [Google Scholar]
- 16.Kramer C.Y., Extension of multiple range tests to group means with unequal numbers of replications, Biometrics 12 (1956), pp. 307–310. [Google Scholar]
- 17.Krishnamoorthy K. and Lu F., A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models, Comput. Stat. Data Anal. 51 (2007), pp. 5731–5742. 10.1016/j.csda.2006.09.039. [DOI] [Google Scholar]
- 18.Kutuk Z.B., Ergin E., Cakir F.Y., and Gurgan S., Effects of in-office bleaching agent combined with different desensitizing agents on enamel, J. Appl. Oral. Sci. 27 (2019). Available at http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1678-77572019000100406&nrm=iso. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Miller Jr R.G., Simultaneous Statistical Inference, 2nd ed., Springer-Verlag, New York; Heidelberg; Berlin, 1981. [Google Scholar]
- 20.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2020. Available at https://www.R-project.org/. [Google Scholar]
- 21.Shao J. and Tu D., The Jackknife and Bootstrap, Springer, New York, NY, 1995. [Google Scholar]
- 22.Signorell A., Aho K., Alfons A., Anderegg N., Aragon T., Arachchige C., Arppe A., Baddeley A., Barton K., Bolker B., Borchers H.W., Caeiro F., Champely S., Chessel D., Chhay L., Cooper N., Cummins C., Dewey M., Doran H.C., Dray S., Dupont C., Eddelbuettel D., Ekstrom C., Elff M., Enos J., Farebrother R.W., Fox J., Francois R., Friendly M., Galili T., Gamer M., Gastwirth J.L., Gegzna V., Gel Y.R., Graber S., Gross J., Grothendieck G., Harrell Jr F.E., Heiberger R., Hoehle M., Hoffmann C.W., Hojsgaard S., Hothorn T., Huerzeler M., Hui W.W., Hurd P., Hyndman R.J., Jackson C., Kohl M., Korpela M., Kuhn M., Labes D., Leisch F., Lemon J., Li D., Maechler M., Magnusson A., Mainwaring B., Malter D., Marsaglia G., Marsaglia J., Matei A., Meyer D., Miao W., Millo G., Min Y., Mitchell D., Mueller F., Naepflin M., Navarro D., Nilsson H., Nordhausen K., Ogle D., Ooi H., Parsons N., Pavoine S., Plate T., Prendergast L., Rapold R., Revelle W., Rinker T., Ripley B.D., Rodriguez C., Russell N., Sabbe N., Scherer R., Seshan V.E., Smithson M., Snow G., Soetaert K., Stahel W.A., Stephenson A., Stevenson M., Stubner R., Templ M., Lang D.T., Therneau T., Tille Y., Torgo L., Trapletti A., Ulrich J., Ushey K., VanDerWal J., Venables B., Verzani J., Iglesias P.J.V., Warnes G.R., Wellek S., Wickham H., Wilcox R.R., Wolf P., Wollschlaeger D., Wood J., Wu Y., Yee T., and Zeileis A., DescTools: Tools for Descriptive Statistics, 2020. https://cran.r-project.org/package=DescTools. R package version 0.99.38
- 23.Sprenger J., Science without (parametric) models: The case of bootstrap resampling, Synthese 180 (2011), pp. 65–76. [Google Scholar]
- 24.Strojek K., Yoon K.H., Hruba V., Elze M., Langkilde A.M., and Parikh S., Effect of dapagliflozin in patients with type 2 diabetes who have inadequate glycaemic control with glimepiride: A randomized, 24-week, double-blind, placebo-controlled trial, Diabetes Obes Metab 13 (2011), pp. 928–938. 10.1111/j.1463-1326.2011.01434.x. [DOI] [PubMed] [Google Scholar]
- 25.Tallarida R. and Murray R., Dunnett's test (comparison with a control), in Manual of Pharmacologic Calculations, Springer, New York, NY, 1987.
- 26.Xu L.-W., Yang F.-Q., Abula A., and Qin S., A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances, J. Multivar. Anal. 115 (2013), pp. 172–180. [Google Scholar]
- 27.Yigit E. and Gökpınar F., A simulation study on tests for one-way anova under the unequal variance assumption, Commun. Fac. Sci. Univ. Ank. Ser. 59 (2010), pp. 15–34. [Google Scholar]
- 28.Zeileis A. and Hothorn T., Diagnostic checking in regression relationships, R News 2 (2002), pp. 7–10. Available at https://CRAN.R-project.org/doc/Rnews/. [Google Scholar]
- 29.Zhang G., A parametric bootstrap approach for one-way ANOVA under unequal variances with unbalanced data, Commun. Stat. Simul. Comput. 44 (2015), pp. 827–832. 10.1080/03610918.2013.794288. [DOI] [Google Scholar]
- 30.Zhang G., Simultaneous confidence intervals for pairwise multiple comparisons in a two-way unbalanced design with unequal variances, J. Stat. Comput. Simul. 85 (2015), pp. 2727–2735. 10.1080/00949655.2014.935735. [DOI] [Google Scholar]
- 31.Ziegler S., Merker S., Streit B., Boner M., and Jacob D.E., Towards understanding isotope variability in elephant ivory to establish isotopic profiling and source-area determination, Biol. Conserv. 197 (2016), pp. 154–163. [Google Scholar]



