Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2023 Aug 9;51(10):1861–1877. doi: 10.1080/02664763.2023.2245179

Multiple comparisons of treatment against control under unequal variances using parametric bootstrap

Sarah Alver 1,CONTACT, Guoyi Zhang 1
PMCID: PMC11271086  PMID: 39071248

ABSTRACT

In one-way analysis of variance models, performing simultaneous multiple comparisons of treatment groups with a control group may be of interest. Dunnett's test is used to test such differences and assumes equal variances of the response variable for each group. This assumption is not always met even after transformation. A parametric bootstrap (PB) method is developed here for comparing multiple treatment group means against the control group with unequal variances and unbalanced data. In simulation studies, the proposed method outperformed Dunnett's test in controlling the type I error under various settings, particularly when data have heteroscedastic variance and unbalanced design. Simulations show that power is often lower for the PB method than for Dunnett's test under equal variance, balanced data, or smaller sample size, but similar to or higher than for Dunnett's test with unequal variance, unbalanced data and larger sample size. The method is applied to a dataset concerning isotope levels found in elephant tusks from various geographical areas. These data have very unbalanced group sizes and unequal variances. This example illustrates that the PB method is easy to implement and avoids the need for transforming data to meet the equal variance assumption, simplifying interpretation of results.

KEYWORDS: Dunnett's test, simulations, ANOVA, HeteANOVA, unbalanced data

MATHEMATICAL SUBJECT CLASSIFICATION: ANOVA

1. Introduction

Consider a one-way analysis of variance (ANOVA) problem with a treatment groups, where the first group is a control group. Let Yij be the value of the response variable in the jth trial for the ith factor level, μ+αi the mean for the ith factor level, i=1,2,,a,j=1,2,,ni. The one-way ANOVA model is as follows:

Yij=μ+αi+εij, (1)

where εijiidN(0,σi2), and iαi=0.

One may wish to perform multiple comparisons of the treatment groups with the control group, rather than performing all pairwise comparisons. In this case, procedures such as the Tukey-Kramer test [16], which examine all pairwise comparisons among the treatment groups, can result in confidence intervals that are wider than necessary [7].

1.1. Dunnett's test

Under the equal variance assumption, Dunnett's test [5,7,19], which uses a procedure similar to Tukey's, can be used for comparing treatment groups only against the control. It is frequently used in clinical or pharmacological studies [4,18,24,25]. Dunnett's test compares a−1 pairs (each group with the control group), instead of the (a2) pairs involved in all pairwise comparisons. Dunnett's test uses the statistic

|Y¯1Y¯i|σ^2(1/n1+1/ni) (2)

where Y¯i is the sample mean for group i, i1, with α1 the parameter associated with the control group, and σ^2 is the pooled variance estimate i=1aj=1ni(YijY¯i)2i=1ania, which can be written as i=1a(ni1)si2/(i=1ania) with si2=j=1ni(YijY¯i)2/(ni1).

When each pair is considered alone, the test statistic in (2) has a t distribution. However, when comparing all treatment groups against the control, the a−1 pairs form a family, and the critical value for constructing confidence intervals and determining statistical significance must be higher in order to keep the type I error rate for the entire family less than α [7,19]. Dunnett [7] used the multivariate t distribution to obtain critical values di,i=2,,a that satisfy Pr (|t2|<d2,,|ta|<da)=1α, where the ti are the test statistics in (2) for each treatment group. Dunnett [7] points out that there are infinitely many solutions for the di, but set d2=d3==da=d. More details of this derivation can be found in Dunnett [7].

As described in Christensen [5] and Miller Jr. [19], Dunnett's test is based on knowing the distribution of the maximum over i of the test statistic in Equation (2) when the null hypothesis of equality of all means is true. That is, (2) will be less than a critical value d only when the maximum of (2) over all a−1 pairs is less than d. When variances are equal, so the pooled variance estimate is appropriate for all treatment groups, and data are balanced, (2) may be compared to the critical values found in the tables by Dunnett [7,8]. These critical values for use in various practical scenarios are also available in software, e.g. the DunnettTest function in the R package DescTools [22]. When the assumption of equal variance is violated, we can modify the test statistic to include the separate variance estimates as shown in the next section. However, the modified test statistic can no longer be compared to, or used to construct confidence intervals with, critical values obtained from a known distribution.

When the assumption of equal variance is violated and data are unbalanced (hereafter called HeteANOVA problem), the results of Dunnett's test are questionable. Many alternative methods were developed for the classical F-test and multiple comparisons for HeteANOVA problems [17,26,29]. Among them, the parametric bootstrap (PB) [17,26] test is shown to be one of the best for testing equality of factor level means [27]. Recently, [29,30] proposed PB multiple comparison tests for one-way and two-way ANOVA, which are shown to be competitive.

1.2. Parametric bootstrap

The following is largely paraphrased from Efron and Tibshirani [11]; additional details can be found there and in many other references such as Efron [9], Shao and Tu [21], Bickel and Freedman [2], and Hall [15]. The bootstrap is a method for assigning measures of accuracy to statistical estimates. Consider a random sample of size n, (x1,x2,,xn), from a probability distribution F. We would like to estimate some quantity of interest of F, for example, the mean or median, from this random sample. The empirical distribution function F^ is the discrete distribution that puts probability 1/n on each value xi, i=1,2,,n. One way to estimate the desired quantity from F, say θ=t(F), is to estimate the corresponding quantity from F^ (the plug-in principle). Thus, the plug-in estimate of θ is θ^=t(F^). The bootstrap method is an application of the plug-in principle. The accuracy of θ^ can be estimated by the bootstrap using the plug-in principle to estimate the standard error (SE) of a statistic. For example, the SE of the mean x¯ is σF/n, where σF is the true standard deviation and is typically unknown. The plug-in estimate of σF is σ^F=σF^={1ni=1n(xix¯)2}1/2. Then the estimated SE se^(x¯)=σF^n={i=1n(xix¯)2/n2}1/2. The bootstrap estimate of seF(θ^) is a plug-in estimate that uses the empirical distribution function F^ in place of the unknown distribution F. Bootstrap methods employ a bootstrap sample: a random sample of size n drawn from the empirical distribution F^, where the probability 1/n is put on each of the observed values xi as described earlier. The bootstrap sample can be denoted x=(x1,x2,,xn), where x is a resampled version of x. The bootstrap datapoints (x1,x2,,xn) are a random sample of size n drawn with replacement from (x1,x2,,xn). From a bootstrap dataset x, a ‘bootstrap replication’ of θ^ can be calculated: θ^=s(x)–applying the same function s() to x that was applied to x. By drawing many independent bootstrap samples and evaluating the bootstrap replications of θ^, the standard error of θ^ can be estimated by the empirical standard deviation se^B of the replications, where B is the number of replications. The limit of se^B as B approaches infinity is seF^(θ^), the bootstrap estimate of seF(θ^). Additionally, the quantity se^B has a variance itself which approaches 0 as both n and B approach infinity. By the Glivenko-Cantelli theorem, the empirical distribution function converges uniformly almost surely to the true distribution function [12,23].

Approximations obtained by random sampling or simulation are called Monte Carlo estimates. The bootstrap distribution can be calculated theoretically or using Monte Carlo approximation [9]. A Monte Carlo algorithm is commonly used to numerically evaluate the bootstrap estimate of the standard error [10]. An advantage of the bootstrap is that it can be applied to some statistics other than the mean x¯, where the SE may not have a known form as with the SE of the mean.

The estimate seF^(θ^) and its approximation se^B are sometimes referred to as nonparametric bootstrap estimates because they are based on F^, the nonparametric estimate of F. Bootstrap sampling can also be performed parametrically. The difference between the nonparametric bootstrap and the parametric bootstrap is that for parametric bootstrap, the samples are drawn from a parametric estimate of the population, F^par, rather than the non-parametric estimate F^. For the non-parametric bootstrap setup, F is completely unknown; that is, only the random sample x is available. F^par is an estimate of F derived from a parametric model; that is, instead of estimating F by the empirical distribution F^, one could assume some distribution, such as normal, for the population. Instead of sampling with replacement from the data, we draw B samples of size n from the parametric estimate of the population F^par. The parametric bootstrap can be used when some knowledge is available about the form of the underlying distribution (though the parameters of that distribution would be unknown [15]). The parametric bootstrap depends on the parametric model assumption. The nonparametric bootstrap is “model assumption free”, but it is not as efficient as the parametric bootstrap when the parametric model is correct. In general, the performance of the bootstrap relies on how well we can identify and estimate the model [21]. For example, in estimating the unknown parameters, the sample mean and variance are consistent estimators of the population mean and variance [3]. Thus, this estimate and correctness of the parametric model would improve with increasing sample size n.

Parametric bootstrap has been used for one-way and two-way ANOVA [17,26] as well as multiple comparison procedures [29,30]. While the test statistics in these procedures do not follow known distributions, they are functions of parameter estimates that do follow known distributions (normal and chi-squared as discussed in the next section). Thus, a distribution for the test statistic can be obtained by computer simulation.

Inspired by Dunnett's test and PB tests, in this research, we develop a PB test that is a special case of the PB multiple comparison procedures described in [29,30]. The PB test presented here is analogous to Dunnett's test, which performs simultaneous multiple comparisons of the treatment groups with the control for the HeteANOVA problem. This research is organized as follows: Section 2 proposes the methodology and presents the algorithm; Section 3 performs simulation studies to evaluate type I error and power; Section 4 gives a real example; and Section 5 gives conclusions and discussion of the research.

2. Proposed PB test and algorithm

In this section, we develop a PB method for multiple comparisons of treatment groups with the control group for a HeteANOVA problem and present an algorithm to implement the test.

2.1. Proposed PB test

Following the procedure from previous papers [17,29], consider the test statistic in Equation (2). We modify this to include the different group variances:

Ti=|Y¯1Y¯i|(s12/n1+si2/ni), (3)

where si2=j=1ni(YijY¯i)2/(ni1) and s12=j=1n1(Y1jY¯1)2/(n11). As noted previously, this test statistic no longer follows a known distribution for comparison; the aim of the PB method is to simulate this distribution. The test we consider is location invariant, so we assume without loss of generality that the mean of Y¯i is zero for all i. Then Y¯iN(0,σi2/ni) and the sample variance Si2σi2n1χ(ni1)2 [3]. These can be approximately simulated by pivot variables Y¯BiN(0,si2/ni), or equivalently, Y¯Bi=Zsi2/ni, where Z is a N(0, 1) random variable, and SBi2si2n1χ(ni1)2.

We can replace Y¯i and si2 in Equation (3) with Y¯Bi and SBi2 to obtain a PB pivot variable:

TPBi=|Y¯B1Y¯Bi|(SB12/n1+SBi2/ni) (4)

As discussed in Section 1.1, Dunnett's test uses critical values from the distribution of the maximum over i of the test statistic in Equation (2) when the null hypothesis of equality of all means is true and treatment groups have equal variances. For the PB method, we simulate a distribution for the test statistic (3), using (4). With this simulated distribution, we can estimate the p-value or obtain a critical value which can be used to construct confidence intervals. The procedure is shown in the following algorithm, and example code for this algorithm is shown in the appendix.

2.2. Parametric bootstrap algorithm for comparing multiple treatment groups with control

2.2.

3. Simulations

3.1. Type I error

To evaluate the performance of the algorithm, we simulated 2500 datasets with μ=0 and αi=0 for all i, such that H0 is true, and compared the rejection rate for both Dunnett's Test, using the DunnettTest function in the R package DescTools [22] and the PB method (Algorithm 1) with L = 5000 bootstrap sample mean and variance vectors. We used a = 6 treatment groups including the control, with σ12=(1,1,1,1,1,1),σ22=(0.1,0.1,0.1,0.5,0.5,0.5),σ32=(1,1,1,0.5,0.5,0.5),σ42=(0.1,0.2,0.3,0.4,0.5,1),σ52=(0.3,0.9,0.4,0.7,0.5,1), and σ62=(0.01,0.1,0.1,0.1,0.1,1). The sample size vectors used in the simulations were n1=(5,5,5,5,5,5), n2=(10,10,10,10,10,10), n3=(3,3,4,5,6,6), n4=(4,6,8,12,16,20), and n5=(35,35,35,35,35,35). The simulation settings follow from [30], and were chosen to include one variance vector where variances were equal and three sample size vectors that reflect smaller, moderate, and relatively large sizes of balanced data. The other vectors include unequal variance and unbalanced data. In this way, we could compare performance of the traditional Dunnett's test with the proposed PB method under conditions where the equal variance assumption is met against cases where it is not. Previous work has also shown traditional multiple comparison procedures to be robust to this violation when data are balanced but to perform more poorly with unbalanced data [30]. All calculations, simulations and data analysis were performed using R [20].

Results are shown in Table 1. With the equal variance assumption met ( σ12), both Dunnett's test and the PB test give acceptable results. Under equal variance, the type I error rates are lower for the PB test than Dunnett's test for n1,n2 and n3, outside the 95% Monte Carlo error interval 0.05±1.96(0.05(0.95)/2500)=(0.0415,0.0585) for n3 and α=0.05 even in the unequal variance cases, likely because the simulated PB distribution could have a larger variance with smaller sample sizes. As noted in Section 1.2, the accuracy of the parametric model would be expected to be worse with smaller sample sizes. This is not noted for n4, which has larger sample sizes. Additionally, Dunnett's test performs satisfactorily in most heteroscedastic cases where data are balanced (though not with unbalanced data). The exception to this is for σ62. In this case, the type I error rate for Dunnett's test is higher than the nominal level even with balanced data. This variance vector includes 0.01 which is small, likely leading to an artificially small pooled variance estimate and thus an artificially large test statistic, so the test rejects more often than the nominal level.

Table 1.

Simulation results for multiple comparisons of treatment group means vs. control.

  α=0.05       α=0.1      
σ12 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0516 0.0420 1.6926 0.7251 0.1024 0.0832 1.6792 0.6980
n2 0.0464 0.0360 1.5634 0.4469 0.1012 0.0928 1.5524 0.4632
n3 0.0448 0.0384 1.8204 1.5017 0.0976 0.0752 1.7969 1.4694
n4 0.0584 0.0592 1.5562 0.8392 0.1028 0.0988 1.5284 0.8229
n5 0.0484 0.0492 1.4720 0.3471 0.1000 0.0956 1.4653 0.3562
σ22 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0464 0.0444 1.8568 0.8214 0.0888 0.0884 1.8485 0.8098
n2 0.0420 0.0460 1.6676 0.4658 0.0848 0.0996 1.6641 0.4863
n3 0.0100 0.0364 1.9029 1.2555 0.0324 0.0736 1.8837 1.2162
n4 0.0016 0.0540 1.6289 0.6557 0.0076 0.0900 1.6047 0.6398
n5 0.0404 0.0520 1.5575 0.3568 0.0748 0.0964 1.5474 0.3545
σ32 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0704 0.0404 1.6311 0.7306 0.1260 0.0792 1.6167 0.7047
n2 0.0752 0.0364 1.5121 0.4572 0.1348 0.0916 1.4960 0.4689
n3 0.1000 0.0412 1.8092 1.7132 0.1720 0.0796 1.7813 1.7004
n4 0.1408 0.0592 1.5376 0.9661 0.2080 0.1008 1.5086 0.9416
n5 0.0712 0.0432 1.4190 0.3524 0.1276 0.0944 1.4127 0.3631
σ42 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0456 0.0472 1.9028 0.8569 0.0768 0.0948 1.8983 0.8479
n2 0.0424 0.0424 1.6984 0.4741 0.0716 0.0980 1.6974 0.4939
n3 0.0104 0.0376 1.9589 1.3092 0.0268 0.0764 1.9551 1.3084
n4 0.0004 0.0496 1.6583 0.5980 0.0024 0.0904 1.6361 0.5890
n5 0.0368 0.0516 1.5824 0.3614 0.0640 0.0948 1.5742 0.3553
σ52 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0348 0.0436 1.8183 0.7893 0.0764 0.0904 1.8100 0.7661
n2 0.0320 0.0412 1.6475 0.4501 0.0632 0.0936 1.6414 0.4753
n3 0.0184 0.0372 1.9176 1.4411 0.0504 0.0756 1.9142 1.4254
n4 0.0096 0.0556 1.6325 0.6940 0.0324 0.0944 1.6106 0.6925
n5 0.0316 0.0512 1.5477 0.3497 0.0668 0.0960 1.5402 0.3507
σ62 Dunnett PB maxiTi¯ V( maxiTi) Dunnett PB maxiTi¯ V( maxiTi)
n1 0.0996 0.0540 2.0542 1.1788 0.1412 0.1068 2.0529 1.1882
n2 0.1000 0.0416 1.7491 0.5207 0.1320 0.1000 1.7509 0.5460
n3 0.0308 0.0412 2.1341 1.7580 0.0544 0.0984 2.1664 1.9283
n4 0.0016 0.0460 1.7351 0.5764 0.0032 0.0880 1.7170 0.5785
n5 0.0964 0.0492 1.6120 0.3796 0.1272 0.0892 1.6049 0.3629

Notes: Numbers in the table are empirical type I error rates. We consider five different sample sizes and six different variance vectors as shown in Section 3, with the two different α levels shown. maxiTi¯ refers to the mean test statistic (3) over all simulated datasets and V( maxiTi) to its variance.

The PB test outperforms Dunnett's test, with type I error rates close to the nominal level for all simulation settings, including those settings with unequal variance and unbalanced data. In all heteroscedastic cases except σ32, the proportion rejected for Dunnett's test is too conservative (less than the nominal level) when the data are unbalanced. In these cases, the smaller variances in the simulations are for groups with smaller sample sizes, and larger variances for groups with larger sample sizes. For these settings, the pooled variance is artificially large, leading to a test statistic that is artificially small. The opposite is true for σ32, which assigns smaller variances to larger group sizes, so the pooled variance estimate is too small and the test statistic too large.

In Dunnett's test statistic, the pooled variance estimate appears in the denominator. Each group's sample variance is weighted by its sample size (minus 1) in the pooled variance estimate. Thus, if a group with a relatively large sample size has a relatively large variance (relative to the other groups), the pooled variance estimate will tend to be large and Dunnett's test statistic will tend to be small, making Dunnett's test too conservative. This is observed in the type I error simulation results for the unbalanced data settings ( n3 and n4) for all unequal variance settings except for σ32, with type I error rates well below the nominal level for Dunnett's test. For σ32, where the relatively large sample-sized group has a relatively small variance, the pooled variance estimate will tend to be small and the test statistic large, leading to inflated type I error rates for Dunnett's test. This is observed in simulations for the unbalanced data settings with σ32.

When data are balanced, Dunnett's test is fairly robust in the simulations to this issue of the pooled variance estimate being too large or too small. The simulations illustrate this in all settings with balanced data except for those paired with σ62, which has a much larger difference in group variances than the other unequal variance settings. In this case, the type I error rate is far from the nominal level even with balanced data. With the unbalanced data and σ62, the small variance belongs to the smallest sample size groups and is ‘outweighed’ in the pooled variance estimate by the largest variance belonging to the largest sample size groups, which again shrinks the test statistic, making Dunnett's test too conservative.

3.2. Evaluation of power

To evaluate the performance of the algorithm in terms of power, we simulated 2500 datasets for each combination of settings with μ=0 and α1=(0,0,0.2,0.2,0.4,0.8) or α2=(0,0,0.3,0.3,0.6,1.2), such that H0 is not true, and compared the rejection rate for both Dunnett's Test, using the DunnettTest function in the R package DescTools [22] and the PB method (Algorithm 2.1) with L = 5000 bootstrap sample mean and variance vectors. We used a = 6 treatment groups including the control, with the same sample size and variance vectors as in the simulations in Section 3.1. The simulation settings follow from [26,30].

Results are shown in Table 2. With equal variance and balanced data ( σ12 and n1; n2), power was similar between the two methods or somewhat lower for the PB version. With unequal variance and unbalanced data, power was often lower for the PB version with the smaller sample size vector ( n3) but somewhat higher for the PB version with the larger sample size vector ( n4). As expected, the power for both tests is generally higher with the mean vector α2 as this has a larger difference between groups, and with the sample size vectors n2 and n4, as these are larger sample sizes for most groups. Dunnett's test generally had higher power with σ32; however, it also had inflated type I error rate for this setting, particularly with unbalanced data, so the power cannot be directly compared. As noted with the results from the type I error simulations, this variance vector assigns larger group variances to smaller sample sizes for these simulation settings, so the pooled variance estimate tends to be smaller and the test statistic larger, leading to Dunnett's test rejecting more often.

Table 2.

Simulation results: mult. comparisons of treatment group means vs. control – power.

      α1   α2
σ12   Dunnett PB Dunnett PB
  n1 0.1460 0.0948 0.3012 0.1904
  n2 0.2660 0.2140 0.5720 0.4804
  n3 0.1192 0.0688 0.2288 0.1144
  n4 0.1944 0.1624 0.4304 0.3076
  n5 0.8256 0.8140 0.9976 0.9980
σ22   Dunnett PB Dunnett PB
  n1 0.4288 0.2980 0.7912 0.6236
  n2 0.7676 0.6936 0.9888 0.9800
  n3 0.2240 0.1968 0.5928 0.4600
  n4 0.4208 0.7008 0.9236 0.9656
  n5 1.0000 1.0000 1.0000 1.0000
σ32   Dunnett PB Dunnett PB
  n1 0.2016 0.1192 0.4108 0.2604
  n2 0.3760 0.2872 0.7396 0.6312
  n3 0.2144 0.0820 0.3868 0.1572
  n4 0.3964 0.1740 0.6940 0.3292
  n5 0.9384 0.9232 1.0000 1.0000
σ42   Dunnett PB Dunnett PB
  n1 0.3304 0.1928 0.6208 0.3992
  n2 0.5772 0.4308 0.9264 0.8160
  n3 0.1612 0.1264 0.4164 0.2744
  n4 0.2148 0.5168 0.7296 0.8940
  n5 0.9932 0.9916 1.0000 1.0000
σ52   Dunnett PB Dunnett PB
  n1 0.2000 0.1516 0.4348 0.3136
  n2 0.3844 0.3492 0.7908 0.7188
  n3 0.1196 0.0880 0.3084 0.1852
  n4 0.1876 0.2980 0.5716 0.6296
  n5 0.9572 0.9640 1.0000 1.0000
σ62   Dunnett PB Dunnett PB
  n1 0.5204 0.4060 0.8132 0.7604
  n2 0.7756 0.8620 0.9812 0.9972
  n3 0.3340 0.2944 0.6468 0.6624
  n4 0.4564 0.9500 0.9232 1.0000
  n5 1.0000 1.0000 1.0000 1.0000

Notes: Numbers in the table are simulated power. We consider five different sample sizes, six different variance vectors, and two different mean vectors as shown in Sections 3.1 and 3.2.

Power was often lower for the PB test, particularly in the settings with small sample sizes and when comparison is restricted to settings where the type I error rates are similar between the two tests. However, with the mean vector that includes larger differences between means ( α2) and larger sample sizes ( n4 and n5), both tests demonstrated acceptable power in most of the variance settings. For the smaller mean vector and setting σ32 and n4, Dunnett's test had higher power than the PB test, but also had a type I error rate well above the nominal level for that setting. Similarly, for σ62 and n4, the PB test had higher power, but also had type I error rate near the nominal level while the type I error rate for Dunnett's test was well below the nominal level for that setting.

4. Application

An example of the method is shown by applying it to the data found in the supplementary material of Ziegler et al. [31]. The data concerns isotope levels in elephant tusks from different geographical areas. Summary statistics are shown in Table 3. The sample sizes and variances are somewhat similar to the simulation setting (for evaluating type I error) of n4 with σ52, although in that simulation setting the largest group had the largest variance which is not true for this example dataset. The groups are reordered in Table 3 to ease this comparison. We considered Asia to be the ‘control’ group as the other regions were in Africa. While Ziegler et al. [31] examined all pairwise comparisons of the different regions using the Games-Howell post-hoc test [14], another possible question of interest could be whether any of the African regions differ from Asia (rather than additionally comparing all of the African regions with each other). The data are very unbalanced, and we can see from Table 3 and Figure 1 that the variances appear unequal for the δ15N isotope ratio (nitrogen stable isotope ratios expressed in δ units). Ziegler et al. [31] looked at several other isotopes and performed additional classification procedures, but we limited the analysis in this study to one isotope simply to illustrate the method.

Table 3.

Summary statistics, δ15N, elephant tusk data.

Region ni Y¯i si2
Asia (i=1) 8 8.49 1.62
East Africa (i=3) 37 9.78 6.15
West Africa (i=5) 69 5.85 1.40
Central Africa (i=2) 120 9.37 3.71
Southern Africa (i=4) 261 8.93 2.78

Figure 1.

Figure 1.

delta15N by region, elephant tusk data.

We fit the one-way ANOVA model and then checked assumptions of normality and constant variance. By the Shapiro-Wilk test for normality using the shapiro.test function in R (W = 0.985, p-value near 0), and examination of a normal plot of the residuals, the normality assumption was violated. The fitted-residual plot from the ANOVA model indicated violation of the equal variance assumption. We also performed the Breusch-Pagan (BP) test using the function bptest from the R package lmtest [28] and Levene's test using the leveneTest function from the R package car [13]. The p-value from both formal tests for equal variance were near 0 and the fitted-residual plot indicated unequal group variances.

A log transformation was attempted to satisfy assumptions. The normality assumption was then satisfied by appearance of the normal plot and formally by the Shapiro-Wilk test, with W = 0.996 and p-value = 0.333. The fitted-residual plot was somewhat improved after transformation but still appeared to violate the equal variance assumption. The p-values for the BP test and Levene's test were 0.031 and 0.013, respectively, for the transformed data. The fitted-residual plots before and after transformation are shown in the appendix.

We performed Dunnett's test, using the previously mentioned function in R, on the untransformed and the log transformed data; both found a significant difference in δ15N isotope levels (nitrogen stable isotope ratios expressed in δ units) between Asia and West Africa, but not the other African regions studied. We then performed the analogous PB test, which came to the same conclusion. The differences between means, confidence intervals and p-values are shown in Tables 4 and 5 for Dunnett's test and Table 6 for the PB test. An illustration of the method is depicted in Figure 2, as a histogram of the PB simulated null distribution with its critical value and the test statistic shown for comparing Asia to West Africa. We note that Ziegler et al. [31] also found a significant difference between Asia and East and Central Africa for this isotope (see Table A4 in their supplementary material). The data they report in their results (see Table 1 of [31]) contained 507 observations including 20 from Asia, while the data we used from their supplementary material contained 495 observations, with only 8 observations from Asia. Additionally, in the supplementary material data, Rwanda appears to be classified as part of Central Africa, but in their Table 1, it is classified as East Africa. So, it is possible that the difference between our findings and those of Ziegler et al. [31] could be due to these differences in sample sizes, with the missing observations coming from Asia. Finally, Ziegler et al. [31] state that, with some exceptions, single isotope markers alone are of little usefulness for forensic purposes, so we emphasize that findings from this study for one particular isotope would be best combined with other results for practical application. They also state several biological and environmental factors to consider when interpreting findings. However, in the interest of brevity, we chose to illustrate the proposed method with only one of the isotope ratios.

Table 4.

Results from Dunnett's test, elephant data.

  Diff Lower CI Upper CI p-value
Central Africa-Asia 0.89 −0.53 2.31 0.27
East Africa-Asia 1.29 −0.23 2.80 0.11
Southern Africa-Asia 0.44 −0.96 1.83 0.69
West Africa-Asia −2.64 −4.09 −1.19 0.00

Table 5.

Results from Dunnett's test, log elephant data.

  Diff Lower CI Upper CI p-value
Central Africa-Asia 0.09 −0.07 0.25 0.35
East Africa-Asia 0.12 −0.05 0.29 0.21
Southern Africa-Asia 0.04 −0.12 0.20 0.77
West Africa-Asia −0.38 −0.54 −0.22 0.00

Table 6.

Results from PB test, elephant data.

  Diff Lower CI Upper CI p-value
Central Africa-Asia 0.89 −0.42 2.19 0.19
East Africa-Asia 1.29 −0.35 2.93 0.12
Southern Africa-Asia 0.44 −0.81 1.69 0.61
West Africa-Asia −2.64 −3.91 −1.36 0.00

Figure 2.

Figure 2.

PB distribution, elephant tusk data.

In this case, while the log transformation corrected the violation of the normality assumption, it could not completely correct the violation of the equal variance assumption though the fitted-residual plot was somewhat improved. Both methods came to the same conclusion for these data. However, Dunnett's test requires the equal variance assumption, so conclusions based on it can be questionable when this assumption is violated. Dunnett's test appears from the simulations to be robust to this violation in most cases when data are balanced, but not when data are unbalanced. In this example dataset, the groups with largest and smallest variance have relatively moderate sample sizes, and the group with the largest sample size has a relatively moderate variance, so the test statistic for Dunnett's test was not made excessively large or small, and it came to the same conclusion as the PB test. However, the PB test does not require the equal variance assumption, so violating this assumption does not call its results into question. The PB test also avoids the need for transformation, which can make interpretation of results more difficult.

Additionally, for the difference between Asia and West Africa, the PB method produced narrower confidence intervals. In this case, the mean squared error from the ANOVA model, which would be used as a pooled variance estimate in the traditional Dunnett's test, was 3.046, which is much larger than the variance for Asia and West Africa in particular, so would produce a test statistic that is smaller than necessary (and thus less likely to reject the null hypothesis of no difference between these two groups). This is somewhat similar to the simulation setting (for evaluating type I error) of n4 with σ52, although in that simulation setting the largest group had the largest variance; in the elephant dataset, the largest group size had the third largest variance. Still, in that simulation setting, Dunnett's test was too conservative due to the pooled variance estimate being too large for some groups, which is similar to the findings here of confidence intervals for Dunnett's test being wider than those of the PB test for the Asia and West Africa comparison.

While the PB method uses the normality assumption, it uses group means in its calculations, which should be approximately normal regardless of the distribution of the individual observations, at least for large samples, by the Central Limit Theorem [3], so it is plausible that the PB test could also be robust to violations of the normality assumption.

5. Conclusions and discussion

In this research, we looked at Dunnett's test from a parametric bootstrap view and proposed a PB test for comparing treatment groups with the control. Simulation results show that both Dunnett's test and the PB test give acceptable results under the equal variance assumption. Additionally, when data are balanced, Dunnett's test performs satisfactorily in most heteroscedastic cases. However, for heteANOVA problems when equal the variance assumption is violated and data are unbalanced, Dunnett's test tends to have type I error rates different from nominal levels, while the proposed PB method's type I error rates were near nominal levels. From the example, we see that the classical way of transformation to deal with unequal variance is not guaranteed and interpretation of the results after transformation can be difficult. The proposed PB test is robust to violation of equal variance and balanced design, and it is easy to implement.

While Dunnett's test performed satisfactorily with most balanced data cases in simulations, the rejection rate can be much higher or lower than the nominal level for the heteANOVA problem. One reason for this is that depending on sample size, if one group's variance is much smaller than the others, the pooled variance estimate will be too large for that group, leading to an artificially small test statistic. Similarly, if one group's variance is much larger than the others, the pooled variance estimate will be too small for that group, leading to an artificially large test statistic. This issue can be amplified when a group with a relatively large sample size also has a variance that is particularly larger or smaller than those of the other groups.

One limitation of the proposed PB method is that it requires the normality assumption, so if a particular dataset violates both assumptions, a transformation may still be needed; however, as in the example here, the PB method may be robust to violation of the normality assumption as well, particularly for large sample sizes.

Another limitation for the PB test is that, in the simulation results, it tended to have low power for sample sizes n1,n2, and n3. Because the power for the PB test tended to be low for these smaller sample sizes, a researcher might gain insight by looking at both the PB test and Dunnett's test when sample sizes are small. If the PB test does not reject but Dunnett's test does, and the groups with large sample sizes (relative to the other groups) also have relatively small variances, one might suspect that Dunnett's test is making a type I error, as shown in the simulations with σ32, particularly with the unbalanced data cases. If the larger sample size groups have larger variances as in several of the simulation settiings, Dunnett's test is less likely to make a type I error. However, a limitation of this study is that we do not determine precisely how large the differences in variance must be or how unbalanced the data must be for Dunnett's test to no longer be robust to the violation. One of the assumptions of Dunnett's test is equal variance, and when this is violated, a researcher should not have high confidence in the conclusions, as with any statistical method whose assumptions are violated.

The simulation results do suggest that Dunnett's test is often robust to violation of the equal variance assumption when data are balanced (but not always, as in the case of σ62, which had larger differences in variances and Dunnett's test showed inflated type I error, even with balanced data). Therefore, we would not recommend using Dunnett's test alone in the setting of unequal variance and unbalanced data. However, based on these simulation results, sample sizes less than 20 (the largest sample size in n4) warrant caution about the power for the PB test. This did not appear to be an issue for sample sizes of 20 or larger: for n4 (the larger of the two unbalanced data vectors), the PB test had higher power in all unequal variance settings except σ32, where the small variance paired with the largest sample size leads to an inflated test statistic and therefore inflated type I error rate, along with higher power, for Dunnett's test. It is difficult to compare power between two tests whose type I error rates are different, but in this case, when the test statistic is inflated by a larger group having a smaller variance, Dunnett's test tends to reject more often whether under the null or the alternative hypothesis, so both its type I error rate and its power tend to be increased.

If a researcher encounters a situation with unequal variances where the PB test does not reject but Dunnett's test does, where (unbalanced) sample sizes are less than 20 and the larger groups do not have relatively smaller variances, they may suspect that the PB test is making a type II error. In this case a reasonable approach would be to transform the data to meet the equal variance assumption and then perform Dunnett's test on the transformed data. However, with sample sizes of 20 or larger, unequal variance, and unbalanced data, the PB method is a viable option and avoids the need for transformation.

Additionally, as described in [6] section 4.3, we may need to exercise caution when making practical decisions based on differences in means between groups with unequal variances. For example, if a lower value of a response is desired, such as blood pressure, a treatment group with a smaller mean and smaller variance may have a smaller probability of achieving the desired outcome than a treatment group with a larger mean but also larger variance. Thus, additional consideration of implications for the practical issue being studied is warranted. With the smallest sample size vector that include group sizes as small as 3, the PB test did have somewhat lower power than Dunnett's test, and type I error rates were often below the nominal level. This is likely due to larger variance in the bootstrap distribution for the test statistic; as shown in Table 1, the variances of the test statistics for n3 were large. While these were from the test statistics from each simulated dataset, the idea of this application of PB is that the distribution of the test statistic is the same as that simulated. The large variance for the test statistic from settings with small sample size vectors was also noted for most of the simulations under alternative hypotheses for evaluation of power (these variances not shown). Despite these limitations, the proposed PB test is a viable method for performing multiple comparisons of treatment vs control for the heteANOVA problem and in simulation settings for this study, controlled the type I error rate well. Areas for future study could include more complex study designs. A simple step-up in complexity could be a two-factor ANOVA model where comparing a group with both factors at baseline with each combination of factor levels was desirable (rather than all pairwise comparisons of the factor combinations). Recently, an R package was developed that includes the algorithm presented here [1].

Appendix.

R Code: The following code is one way to program the PB test (Algorithm ) to simulate a distribution for the PB test statistic. The output here is the test statistic and the p-value, but could be modified to return other values, for example, Dcrit or confidence intervals. Processing time was checked for L=100,000 for select scenarios ( n1 and n2 with σ12 and n4 with σ12 σ52 and from the simulation studies). For one simulated dataset with these scenarios, the maximum processing time was 2.559 seconds. With L = 5000, the maximum processing time for one of these datasets was 0.164 seconds.

Appendix.

Figure A1.

Figure A1.

Fitted-residual plots before and after transformation, elephant data.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Alver S. and Zhang G., pbANOVA: Parametric Bootstrap for ANOVA Models, 2022. https://CRAN.R-project.org/package=pbANOVA. R package version 0.1.0.
  • 2.Bickel P.J. and Freedman D.A., Some asymptotic theory for the bootstrap, Ann. Stat. 9 (1981), pp. 1196–1217. [Google Scholar]
  • 3.Casella G. and Berger R.L., Statistical Inference, 2nd ed., Duxbury, Pacific Grove, CA, 2002. [Google Scholar]
  • 4.Cheng W.S.C., Murphy T.L., Smith M.T., Cooksley W.G.E., Halliday J.W., and Powell L.W., Dose-dependent pharmacokinetics of caffeine in humans: Relevance as a test of quantitative liver function, Clin. Pharmacol. Ther. 47 (1990), pp. 516–524. 10.1038/clpt.1990.66. [DOI] [PubMed] [Google Scholar]
  • 5.Christensen R., Analysis of Variance, Design, and Regression: Applied Statistical Methods, Chapman and Hall/CRC, Boca Raton, FL, 1996. [Google Scholar]
  • 6.Christensen R., Analysis of Variance, Design, and Regression: Linear Modeling for Unbalanced Data, 2nd ed., CRC Press, Boca Raton, FL, 2016. [Google Scholar]
  • 7.Dunnett C.W., A multiple comparison procedure for comparing several treatments with a control, J. Am. Stat. Assoc. 50 (1955), pp. 1096–1121. http://www.jstor.org/stable/2281208. [Google Scholar]
  • 8.Dunnett C.W., New tables for multiple comparisons with a control, Biometrics 20 (1964), pp. 482–491. [Google Scholar]
  • 9.Efron B., Bootstrap methods: Another look at the jackknife, Ann. Stat. 7 (1979), pp. 1–26. [Google Scholar]
  • 10.Efron B. and Tibshirani R., Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Stat. Sci. 1 (1986), pp. 54–75. [Google Scholar]
  • 11.Efron B. and Tibshirani R.J., An Introduction to the Bootstrap, Chapman & Hall, New York, NY, 1993. [Google Scholar]
  • 12.Ferguson T.S., A Course in Large Sample Theory, Chapman & Hall, London, UK, 1996. [Google Scholar]
  • 13.Fox J. and Weisberg S., An R Companion to Applied Regression, 3rd ed., Sage, Thousand Oaks, CA, 2019. Available at https://socialsciences.mcmaster.ca/jfox/Books/Companion/. [Google Scholar]
  • 14.Games P.A. and Howell J.F., Pairwise multiple comparison procedures with unequal N's and/or variances: A monte carlo study, J. Educ. Stat. 1 (1976), pp. 113–125. [Google Scholar]
  • 15.Hall P., The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York, NY, 1992. [Google Scholar]
  • 16.Kramer C.Y., Extension of multiple range tests to group means with unequal numbers of replications, Biometrics 12 (1956), pp. 307–310. [Google Scholar]
  • 17.Krishnamoorthy K. and Lu F., A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models, Comput. Stat. Data Anal. 51 (2007), pp. 5731–5742. 10.1016/j.csda.2006.09.039. [DOI] [Google Scholar]
  • 18.Kutuk Z.B., Ergin E., Cakir F.Y., and Gurgan S., Effects of in-office bleaching agent combined with different desensitizing agents on enamel, J. Appl. Oral. Sci. 27 (2019). Available at http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1678-77572019000100406&nrm=iso. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Miller Jr R.G., Simultaneous Statistical Inference, 2nd ed., Springer-Verlag, New York; Heidelberg; Berlin, 1981. [Google Scholar]
  • 20.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2020. Available at https://www.R-project.org/. [Google Scholar]
  • 21.Shao J. and Tu D., The Jackknife and Bootstrap, Springer, New York, NY, 1995. [Google Scholar]
  • 22.Signorell A., Aho K., Alfons A., Anderegg N., Aragon T., Arachchige C., Arppe A., Baddeley A., Barton K., Bolker B., Borchers H.W., Caeiro F., Champely S., Chessel D., Chhay L., Cooper N., Cummins C., Dewey M., Doran H.C., Dray S., Dupont C., Eddelbuettel D., Ekstrom C., Elff M., Enos J., Farebrother R.W., Fox J., Francois R., Friendly M., Galili T., Gamer M., Gastwirth J.L., Gegzna V., Gel Y.R., Graber S., Gross J., Grothendieck G., Harrell Jr F.E., Heiberger R., Hoehle M., Hoffmann C.W., Hojsgaard S., Hothorn T., Huerzeler M., Hui W.W., Hurd P., Hyndman R.J., Jackson C., Kohl M., Korpela M., Kuhn M., Labes D., Leisch F., Lemon J., Li D., Maechler M., Magnusson A., Mainwaring B., Malter D., Marsaglia G., Marsaglia J., Matei A., Meyer D., Miao W., Millo G., Min Y., Mitchell D., Mueller F., Naepflin M., Navarro D., Nilsson H., Nordhausen K., Ogle D., Ooi H., Parsons N., Pavoine S., Plate T., Prendergast L., Rapold R., Revelle W., Rinker T., Ripley B.D., Rodriguez C., Russell N., Sabbe N., Scherer R., Seshan V.E., Smithson M., Snow G., Soetaert K., Stahel W.A., Stephenson A., Stevenson M., Stubner R., Templ M., Lang D.T., Therneau T., Tille Y., Torgo L., Trapletti A., Ulrich J., Ushey K., VanDerWal J., Venables B., Verzani J., Iglesias P.J.V., Warnes G.R., Wellek S., Wickham H., Wilcox R.R., Wolf P., Wollschlaeger D., Wood J., Wu Y., Yee T., and Zeileis A., DescTools: Tools for Descriptive Statistics, 2020. https://cran.r-project.org/package=DescTools. R package version 0.99.38
  • 23.Sprenger J., Science without (parametric) models: The case of bootstrap resampling, Synthese 180 (2011), pp. 65–76. [Google Scholar]
  • 24.Strojek K., Yoon K.H., Hruba V., Elze M., Langkilde A.M., and Parikh S., Effect of dapagliflozin in patients with type 2 diabetes who have inadequate glycaemic control with glimepiride: A randomized, 24-week, double-blind, placebo-controlled trial, Diabetes Obes Metab 13 (2011), pp. 928–938. 10.1111/j.1463-1326.2011.01434.x. [DOI] [PubMed] [Google Scholar]
  • 25.Tallarida R. and Murray R., Dunnett's test (comparison with a control), in Manual of Pharmacologic Calculations, Springer, New York, NY, 1987.
  • 26.Xu L.-W., Yang F.-Q., Abula A., and Qin S., A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances, J. Multivar. Anal. 115 (2013), pp. 172–180. [Google Scholar]
  • 27.Yigit E. and Gökpınar F., A simulation study on tests for one-way anova under the unequal variance assumption, Commun. Fac. Sci. Univ. Ank. Ser. 59 (2010), pp. 15–34. [Google Scholar]
  • 28.Zeileis A. and Hothorn T., Diagnostic checking in regression relationships, R News 2 (2002), pp. 7–10. Available at https://CRAN.R-project.org/doc/Rnews/. [Google Scholar]
  • 29.Zhang G., A parametric bootstrap approach for one-way ANOVA under unequal variances with unbalanced data, Commun. Stat. Simul. Comput. 44 (2015), pp. 827–832. 10.1080/03610918.2013.794288. [DOI] [Google Scholar]
  • 30.Zhang G., Simultaneous confidence intervals for pairwise multiple comparisons in a two-way unbalanced design with unequal variances, J. Stat. Comput. Simul. 85 (2015), pp. 2727–2735. 10.1080/00949655.2014.935735. [DOI] [Google Scholar]
  • 31.Ziegler S., Merker S., Streit B., Boner M., and Jacob D.E., Towards understanding isotope variability in elephant ivory to establish isotopic profiling and source-area determination, Biol. Conserv. 197 (2016), pp. 154–163. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES