Violating the normality assumption may be the lesser of two evils

. 2021 May 7;53(6):2576–2590. doi: 10.3758/s13428-021-01587-5

Box 1 Examples of specialized techniques that may result in a high rate of false-positive findings due to unrecognized problems of pseudoreplication

(A) Many researchers, being concerned about fitting an “inappropriate” Gaussian model, hold the believe that binomial data always requires modelling a binomial error structure, and that count data mandates modeling a Poisson-like process. Yet, what they consider to be “more appropriate for the data at hand” may often fail to acknowledge the non-independence of events in count data (Forstmeier et al., 2017; Harrison, 2014, 2015; Ives, 2015). For instance, in a study of butterflies choosing between two species of host plants for egg laying, an individual butterfly may first sit down on species A and deposit a clutch of 50 eggs, followed by a second landing on species B where another 50 eggs are laid. If we characterize the host preference for species A of this individual by the total number of eggs deposited (p(A) = 0.5, N = 100) we obtain a highly anticonservative estimate of uncertainty (95% CI for p(A): 0.398–0.602), while if we base our preference estimate on the number of landings (p(A) = 0.5, N = 2) we obtain a much more appropriate confidence interval (95% CI for p(A): 0.013–0.987). Even some methodological “how-to” guides (e.g., Fordyce et al., 2011; Harrison et al., 2018; Ramsey & Schafer, 2013) forgot to clearly explain that it is absolutely essential to model the non-independence of events via random effects or overdispersion parameters (Harrison, 2014, 2015; Ives, 2015; Zuur et al., 2009). Unfortunately, non-Gaussian models with multiple random effects often fail to reach model convergence (e.g., Brooks et al., 2017), which often lets researchers settle for a model that ignores non-independence and yields estimates with inappropriately high confidence and statistical significance (Arnqvist, 2020; Barr et al., 2013; Forstmeier et al., 2017)

(B) When observational data do not comply with any distributional assumption, randomization techniques like bootstrapping seem to offer an ideal solution for working out the rate at which a certain estimate arises by chance alone (Good, 2005). However, such resampling can also be risky in terms of producing false-positive findings if the data is structured (temporal autocorrelation, random effects; e.g., Ihle et al., 2019) and if this structure is not accounted for in the resampling regime (blockwise bootstrap; e.g., Önöz & Bayazit, 2012). Specifically, there is the risk that non-independence introduces a strong pattern in the observed data, but, in the simulated data, comparably strong patterns do not emerge because the confounding non-independencies were broken up (Ihle et al., 2019). We argue that pseudoreplication is a well-known problem that has been solved reasonably well within the framework of mixed models, and the consideration or neglect of essential random effects can be readily judged from tables that present the model output. In contrast, the issue of pseudoreplication is more easily overlooked in studies that implement randomization tests, where the credibility of findings hinges on details of the resampling procedure that are not understood by the majority of readers. One possible way of validating a randomization procedure, may be to repeat an experiment several times, and to combine all the obtained effect estimates with their SEs in a formal meta-analysis. If the meta-analysis indicates that there is substantial heterogeneity in effect sizes (I² > 0), then the SEs obtained from randomizations were apparently too small (anticonservative), hence not allowing to draw general conclusions that would also hold up in independent repetitions of the experiment. Unfortunately, such validations on real data are not so often carried out when a new randomization approach is being introduced, and this shortcoming may imply that numerous empirical studies publish significant findings (due to a high type I error rate) before the methodological glitch gets discovered.