Abstract
Previous researchers have proposed the a priori procedure, whereby the researcher specifies, prior to data collection, how closely she wishes the sample means to approach corresponding population means, and the degree of confidence of meeting the specification. However, an important limitation of previous research is that researchers sometimes are interested in differences between means, rather than in the means themselves. To address this limitation, we propose additional equations that expand the a priori procedure to handle differences between means, both in matched and in independent samples. Finally, implications are discussed.
Keywords: a priori procedure, means, matched samples, independent samples
The work presented here is based on the a priori procedure (APP) introduced by Trafimow (2017) to apply to a single mean and expanded by Trafimow and MacDonald (2017) to apply to any number of means. The basic idea, as will be explained in the following section in more detail, is that the researcher specifies how close she wishes her sample means to be to their corresponding population means, and the probability she wishes to have of being within the specified closeness. APP equations provide the opportunity for the researcher to compute the sample size needed to meet closeness and probability specifications.
However, given the recency of APP publications, it is unavoidable that there are limitations. One such limitation is that researchers often are interested in differences between means rather than in the means themselves, and the articles by Trafimow (2017) and Trafimow and MacDonald (2017) do not treat differences in means. The present goal is to address this limitation. Like the cited works, the equations to be presented will assume random and independent sampling from normal distributions. However, the equations to be presented will focus on differences between means for both matched and independent samples, thereby resolving an important APP limitation.
Introduction to the APP
To understand the equations to be presented, it is important to have a thorough understanding of the APP and its accompanying philosophy. For the sake of clarity and ease of exposition, imagine the simplest possible case where a researcher has only a single group and is concerned with using the sample mean to estimate the population mean. There are two questions for the researcher to consider before collecting the data; these questions pertain to precision and confidence:
How close (precision) does the researcher desire the sample mean be to the population mean?
With what probability (degree of confidence) does the researcher desire that the sample mean will be within the stipulated distance of the population mean?
Given researcher specifications for precision and confidence for these two questions, an APP equation gives the sample size needed to meet them.
For example, suppose the researcher wishes to obtain a sample mean within 1/10th of a standard deviation of the population mean, and wishes to be 95% confident that the sample mean to be obtained will meet the precision specification. Letting denote the sample size, denote the fraction of a standard deviation the researcher defines as “close,” and denote the z score corresponding to the degree of confidence the researcher wishes to have, Equation 1 can be used to determine the necessary sample size (Trafimow, 2017),
| (1) |
It is easy to instantiate the example specifications into Equation 1. Because our hypothetical researcher had specified a desire to have a closeness of 1/10th of a standard deviation, and because the z score that corresponds to a confidence level of 95% is 1.96, these values imply the following: .1 Rounding up to the nearest whole number implies that the researcher needs to collect 385 participants to meet specifications for closeness and confidence.
As stated earlier, Trafimow and MacDonald (2017) expanded Equation 1 to include as many means as the researcher wishes to have; but a limitation is that no equations exist yet for addressing differences in means. This limitation will be addressed subsequently. First, however, the following two subsections will address common misperceptions that researchers sometimes have about the APP.
Is the APP Merely Another Way to Perform Power Analysis?
Our answer is in the negative. To see why, remember that the goal of power analysis is to determine the sample size needed to meet a p-value threshold (e.g., p < .05), whereas the goal of the APP is to determine the sample size needed to reach specifications for precision and confidence. Thus, power analysis is influenced by the anticipated effect size, whereas the APP is not. Moreover, the APP is influenced by the desired level of precision, whereas power analysis is not. Thus, if the expected effect size is large, power analysis will indicate that a small sample size is sufficient to give the researcher a good chance to meet a significance threshold; but the APP will indicate that the researcher will nevertheless have very imprecise estimates of population means. Alternatively, suppose the expected effect size is extremely small. A power analysis would indicate that even an impressive sample size is insufficient to meet a significance threshold with an impressive probability; but the APP nevertheless would indicate that the researcher likely will have very precise estimates of population means. Trafimow and Myüz (2018) provide a detailed description, with accompanying demonstrations, of the important ways that the APP and power analysis differ.
It is possible that a researcher might wish to achieve the goals of both the APP and power analysis. That is, the researcher might wish to have (a) strong precision and confidence and (b) a respectable probability of obtaining a statistically significant finding. A conservative route to fulfilling both goals is to perform the APP and power analysis to see which one results in the largest required sample size. The conservative researcher would then collect a sample of data that meets or exceeds the largest required sample size.
Is the APP Incompatible With Typical Inferential Approaches?
As in the previous subsection, the answer is in the negative. There is nothing in the APP to prevent researchers from using p values, confidence intervals, Bayesian procedures, and so on, once the data have been collected. However, it could be argued that the APP removes the necessity for such procedures, depending on the researcher’s goal. If the goal is to obtain sample statistics that precisely estimate corresponding population parameters, then the APP suffices; no postdata inferential approaches are necessary. In contrast, if the researcher wishes to test hypotheses, obtain confidence intervals, or obtain a Bayesian estimate of the probability of the hypothesis, postdata inferential procedures are necessary. A caveat is that there has been much debate in the literature, and at the recent Symposium on Statistical Inference by the American Statistical Association (October 2017), about the extent to which different postdata inferential procedures succeed at accomplishing their goals. For present purposes, it is not necessary to engage this debate.
Difference in Means for Matched Samples
One way for researchers to reduce the requirement for large sample sizes is to perform experiments with matched samples, where each participant undergoes both the experimental and control treatments. For example, in a social psychology experiment on prejudicial attitudes, the researcher might ask participants to indicate attitudes toward both Black and White targets. The contrast of interest would be the mean of the Black target versus the mean of the White target. Or in an education context, the researcher might try out typical versus novel procedures on the same participants, with the goal being to contrast the means of the two conditions to determine if the novel educational procedure is superior to the typical one. How can the researcher determine the sample size needed to meet specifications for precision and confidence when the contrast of interest is between means from matched samples?
The Population Standard Deviation Is Known
If the population standard deviation is known, Equation 1 suffices. Thus, taking the social psychology experiment on prejudicial attitudes again, suppose the researcher wishes to be 95% confident that the difference in sample means between the Black and White conditions is within 1/10th of a standard deviation of the difference in population means between these two conditions. In that case, a computation akin to one made earlier suffices: . The requirement of 385 participants may seem unreasonable, but remember that this is under what Trafimow (2018) termed “extreme precision.” A reduced precision requirement dramatically reduces the necessary sample size. For example, if the researcher is willing to settle for what Trafimow (2018) termed “poor precision” of 4/10ths, the necessary sample size is dramatically reduced: . Figure 1 provides an illustration of how the precision requirement dramatically influences the necessary sample size under 95% confidence or 90% confidence.
Figure 1.
The necessary sample size to meet specifications is expressed along the vertical axis as a function of the precision specification along the horizontal axis, at 95% confidence (solid curve) or 90% confidence (dashed curve); under the assumption of a known population standard deviation for matched samples.
The Population Standard Deviation Is Not Known
When the population standard deviation is not known, it is necessary to use the t distribution, as Equation 2 indicates. Appendix A (available online) provides a derivation of Equation A1, which is copied below as Equation 2, and Appendix C (available online) provides R code;
| (2) |
where is the critical t score, analogous to the use of the z score from Equation 1. Unfortunately, unlike Equation 1, Equation 2 must be solved numerically. For example, suppose that a researcher wishes to specify at 95% confidence, with unknown standard deviations for matched samples. The researcher might try , so the right side of Equation 2 is 1.9899, which satisfies Equation 2. In contrast, the researcher might try , so the right side of Equation 2 is 1.9799, which does not satisfy Equation 2 because the t value for is 1.9847. Thus, because satisfies Equation 2, whereas does not, the minimum sample size necessary to meet specifications for precision and confidence is . Equation 2 is best handled with a computer that is programmed to try different values until convergence on the smallest sample size that fulfils requirements (see Appendix C for R code).
Figure 2 presents the sample sizes necessary to meet the requirements for a variety of precision and confidence values. The necessity to use the t distribution in Figure 2 does not make much of a difference relative to the use of the z distribution for Figure 1. In both cases, much precision requires a large sample size, with sample size requirements decreasing rapidly as precision decreases.2 For example, when and confidence is 95%, the required sample sizes are 385 and 387, respectively, depending on whether the population standard deviation is known or unknown. More generally, Figures 1 and 2 are almost identical. As will become clear presently, matters change dramatically when a researcher is interested in a difference in means across independent samples.
Figure 2.
The necessary sample size to meet specifications is expressed along the vertical axis as a function of the precision specification along the horizontal axis, at 95% confidence (solid curve) or 90% confidence (dashed curve); under the assumption of unknown population standard deviations for matched samples.
To aid in understanding when to use Equation 2, consider the example of an education researcher who wishes to test the efficacy of a new teaching technique designed to increase high school students’ knowledge of world history. The researcher does not have enough resources to test different teaching techniques against each other; but does have enough resources to assess performance on a test with 100 questions before and after students are exposed to the teaching technique of interest. Because it is obvious that students will perform better after obtaining relevant world history information, the researcher is not interested in demonstrating an effect. Rather, the researcher is interested in finding out the size of the effect; that is, the difference between pretest and posttest in mean number of questions students answer correctly. Because the size of the difference between pretest and posttest is critical, the researcher wishes to have excellent precision and decides on . Based on Equation 2, the minimum sample size that renders the equation true is ; thus, the researcher endeavors to collect at least 387 participants.
Independent Samples
Commencing with the unlikely case where the standard deviations are known and equal, and the sample sizes are equal, Equation 1 can be used to calculate the sample sizes necessary to meet specifications; but with a caveat. The caveat is that in the independent samples context, refers to each of two independent groups, so the total sample size needed is rather than simply as in the foregoing section. Let us now move to a more realistic case where the standard deviations may be unknown; nor is it necessary to assume equal sample sizes.
When a researcher has independent samples, as opposed to matched ones, there is no guarantee that the sample sizes will be equal. It is convenient to designate that there are participants in the smaller group and participants in the larger group, where . Appendix B (available online) provides the derivation of Equation B2, which is copied below as Equation 3, and Appendix C provides R code;
| (3) |
where is the critical t score that corresponds to the level of confidence level and degrees of freedom in which is rounded to the nearest upper integer.
If the researcher has equal sample sizes, Equation 3 reduces to Equation 4:
| (4) |
Figure 3 presents sample sizes necessary to meet specifications for a variety of levels of precision and confidence. In strong contrast to Figures 1 and 2, an interest in the difference between independent means forces a much larger sample size requirement. For example, when at 95% confidence, as opposed to when there are matched samples. And matters arguably are even worse than this. Consider that with independent samples, the total sample size is , which is much larger than the total sample size of 387 in the matched samples case.
Figure 3.
The necessary sample size to meet specifications is expressed along the vertical axis as a function of the precision specification along the horizontal axis, at 95% confidence (solid curve) or 90% confidence (dashed curve); under the assumption of equal sample sizes for independent samples.
Figure 4 allows the sample sizes in the two conditions to differ, so that . The sample size of the smaller group is represented along the vertical axis.
Figure 4.
The necessary sample size (for the smaller group) to meet specifications is expressed along the vertical axis as a function of the precision specification along the horizontal axis, at 95% confidence (solid curve) or 90% confidence (dashed curve); under the assumption of unequal sample sizes for independent samples, where .
To aid the researcher in understanding when Equation 3 (or Equation 4) is more useful than Equation 2, let us reconsider the world history example discussed earlier in the context of Equation 2. In the previous example, the researcher did not have enough resources to test the teaching technique of interest against an alternative teaching technique and had to settle for a pretest–posttest design. But let us now imagine that the researcher has the resources to randomly assign students to use the old versus the new teaching techniques. The issue is no longer about the difference in performance between pretest and posttest; but rather about the difference in posttest performances between the old and new conditions. In addition, let us suppose that the researcher is not merely interested in demonstrating a difference; but wishes to know the size of the difference to justify the extra cost involved in switching over to the new teaching technique. Strong precision is desired and so that the researcher sets . Because the researcher has the power of random assignment, the sample sizes can be expected to be approximately equal, thereby justifying the use of Equation 4 over Equation 3. And solving the inequality with the least possible sample size implies that the researcher will need to collect at least 770 participants in each condition, for a total of 1,540 participants.
Discussion
Our main goal was to provide a way for researchers to perform the APP with respect to differences between means, rather than with respect to only the means themselves. The present equations accomplish this, both for matched samples and for independent samples. In addition, although Figures 1 and 2 demonstrate that necessary sample sizes to meet precision and confidence specifications do not increase much for matched samples, necessary sample sizes may increase dramatically for independent samples. This fact can be considered as an argument in favor of matched samples when feasible or when not ruled out by a concern for order effects.
Why Care About Precision?
Given the pessimistic values in Figures 3 and 4 for independent samples, one might question the need to care about precision. If the researcher need not care about precision, then she is not saddled with a requirement for large sample sizes to obtain that precision. To respond to that potential argument, it is worthwhile to consider the reason for performing the experiment in the first place. Let us consider the issue both from applied and basic research perspectives.
From an applied perspective, the difference in means to be obtained will serve an applied goal. For example, in a comparison of two educational techniques, the superiority of a novel technique over the present one may provide a reason to switch. This depends both on just how superior the novel technique is and whether that degree of superiority is worth the costs of switching (and possibly other costs too). But if the precision is low, then the difference obtained in the study cannot be trusted, and there is a lack of a solid basis for making the decision.
From a basic research perspective, the difference in means to be obtained is in the interest of supporting or disconfirming a theory; or testing alternative theories against each other. Although there are many philosophical orientations, such as falsification (Popper, 1959), verification (Hempel, 1965), abduction (Haig, 2014), and others, that contradict each other with respect to emphasis placed on theory versus data, order of consideration of theory and data, and others, they agree in one extremely important respect. That is, all respectable scientific philosophies agree that at some point theory must be checked against data. This stricture renders immediately desirable that the data be understood as precisely as possible. Thus, for example, if a researcher wishes to use the difference between means to test a theoretical prediction, it is desirable to know that difference as precisely as possible. More generally, from either an applied or basic research perspective, it is difficult to dodge the necessity to have precise estimates.
The Connection Between Precision and Replication
Yet another reason for researchers to attend carefully to precision is its relevance to replication, which has taken on a greater than normal concern as of late due to increased questioning about whether there is a replication crisis in psychology (see Trafimow, 2018, for a review). Trafimow (2018) provided a detailed argument, with equations that relate replication to precision; but the gist of the case can be provided quickly. APP equations can be made to show probabilities of reaching given levels of precision under given sample sizes. Once this has been done, it is possible to imagine a replication experiment that is the same as the original experiment with the exception that random processes will differ. Of course, in real science, there will be systematic differences too. But at least in the ideal case, where there are no systematic differences between experiments, it is possible to obtain an idealized probability of replication merely by squaring the probability obtained with respect to the original experiment, to account for the replication too. If the idealized probability of replication is low, and Trafimow (2018) showed that it almost always is low, then the researcher can be assured that the probability of replication in the real scientific universe, where there are systematic differences as well as random ones, will be even lower. Of course, all of this depends on the degree of precision on which the researcher insists; the more stringent the precision requirement, the lower the probability of replication.3 On the other hand, the greater the sample sizes in the two experiments, and hence the greater the obtained precision, the greater the probability of replication. Hence, there is a strong connection between precision and replication. Any researcher who cares about replication is forced, by mathematical necessity, to care about precision as well.
Skewness
The new equations address one limitation—the issue of differences between means as opposed to the means themselves—but are nevertheless subject to an old and familiar limitation: that is, the assumption of normally distributed populations. Suppose one or more of the distributions of interest are skewed. How much of a problem is that for the present equations?
One way to consider the question is to employ the family of skew-normal distributions, of which the family of normal distributions is a subset. Unlike normal distributions, which have two parameters; skew-normal distributions have three parameters. In skew-normal distributions, the mean is replaced by the location parameter , the standard deviation is replaced by the scale parameter , and there is an additional shape parameter . When the shape parameter is zero, the distribution is normal, and the location parameter equals the mean and the scale parameter equals the standard deviation. But when the shape parameter is not equal to zero, the distribution is skew-normal, and the location parameter is not equal to the mean and the scale parameter is not equal to the standard deviation. Reiterating in symbols, when , and . But when , and . Thus, the family of skew-normal distributions can be considered a generalization of the family of normal distributions.
The obvious way to address the issue of skewness, then, is to derive equations analogous to those derived above, but with respect to differences in locations as opposed to differences in means. The authors are currently working on that solution. In the meantime, part of the solution already has been published in Educational and Psychological Measurement (Trafimow, Wang, & Wang, 2019). That is, Trafimow et al. (2019) showed how to apply the APP to a single location parameter of a skew-normal distribution. The surprising finding from that research is that skewness aids, rather than detracts, from precision. Put another way, the more skewed the distribution, the smaller the sample size needed to meet any arbitrarily designated level of precision. Although Trafimow et al. (2019) provided a complex mathematical demonstration, the reason can be stated qualitatively. Skew-normal distributions are narrower and taller than are normal distributions, thereby resulting in a precision advantage for skew-normal as opposed to normal distributions. And the greater the absolute magnitude of the skew, the greater the precision advantage.
Well, then, let us return to the issue of the researcher who assumes normality when there is skewness. The good news for this researcher is that if she has satisfied specifications for precision and confidence based on APP normal equations, those specifications also will be satisfied if there is unknown skewness. In fact, the researcher will have extra precision, though the extent of that extra precision is unknown. The bad news for the researcher is that by assuming normality when there is skewness, sample size calculations will be too large, thereby resulting in extra effort on the part of the researcher to obtain the seemingly required sample. Work is in progress to quantify the extent of these effects.
Supplemental Limitations
There are two last limitations to briefly address. First, there is the issue of heterogeneity of variances and whether this influences APP calculations. As can be seen in the Appendices A-C, we assumed equal variances. To know what happens when this assumption is violated, it is necessary to develop new APP equations assuming unequal variances. This work is in progress, using not only the family of normal distributions but also the family of skew-normal distributions. In the latter case, of course, it is better to use squared scales rather than variances (see foregoing subsection).
A second issue is that the researcher might have multiple groups. Previous APP work partially addresses this issue. That is, Trafimow and MacDonald (2017) showed how to perform APP calculations pertaining to means for any number of groups. The main limitation of this work, however, is that APP calculations for means themselves, even for multiple groups, is not the same thing as differences between means involving multiple groups. For example, suppose that there are two control groups and a single experimental group. The researcher is interested in estimating the necessary sample size to meet specifications for precision and confidence comparing the experimental group with both control groups, simultaneously. As of now, there are no APP equations to facilitate this researcher’s interest; though the present equations can be used for sequential comparisons. Work is in progress to facilitate simultaneous comparisons involving multiple means (normal distributions) or multiple locations (skew-normal distributions).
Conclusion
The present work is part of a larger APP mosaic that is in its early stages of development. The original work pertained to single means obtained in single groups (Trafimow, 2017), whereas a following work pertained to means obtained from any number of groups (Trafimow & MacDonald, 2017). But as we pointed out earlier, researchers often are concerned with differences between means, as opposed to the means themselves. The present work shows how researchers can perform the APP with respect to differences between means thereby providing an important piece of the mosaic.
But the present advance should not obscure the possibility that researchers can be interested simultaneously in differences between means, and the means themselves. For example, consider two methods for measuring educational attainment. Although it is of obvious interest whether one method is better than the other, and by how much; the independent goodness of the methods matters too. For example, suppose one method is better than the other, but still not very good when considered alone. The rigorous researcher might be disposed to find yet a third method. In the process of coming to this realization, although the present APP equations are crucial for determining the difference between the two methods with precision, previous APP equations concerned only with the individual means are also relevant. More generally, our argument is not that one set of APP equations should be preferred over another set; nor even that the APP should be preferred over other inferential procedures; but rather that there are many ways for researchers to think, with associated procedures, and that bettering the versatility of the statistical toolbox is worthwhile.
Supplemental Material
Supplemental material, Online_Appendix for Making the A Priori Procedure Work for Differences Between Means by David Trafimow, Cong Wang and Tonghui Wang in Educational and Psychological Measurement
The 1.96 figure assumes that both tails of the z distribution have areas of .025, to correspond with 95% confidence.
Remember that because Equation 1 works for matched samples with a known standard deviation and a single sample, Figure 1 also is representative of both cases.
This should not be an excuse for demanding lower precision. Rather, it should be a reason for researchers to face up to the choice that they either (a) demand high precision and collect appropriate sample sizes or (b) admit the difficulty, with appropriate modesty, of drawing precise conclusions from their data.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: David Trafimow
https://orcid.org/0000-0002-0788-8044
Supplemental Material: Supplemental material for this article is available online.
References
- Haig B. D. (2014). Investigating the psychological world: Scientific method in the behavioral sciences. London, England: MIT Press. ISBN-13: 978-0262027366 [Google Scholar]
- Hempel C. G. (1965). Aspects of scientific explanation and other essays in the philosophy of science. New York, NY: Free Press. ISBN-13: 978-0029143407 [Google Scholar]
- Popper K. R. (1959). The logic of scientific discovery. New York, NY: Basic Books. [Google Scholar]
- Trafimow D. (2017). Using the coefficient of confidence to make the philosophical switch from a posteriori to a priori inferential statistics. Educational and Psychological Measurement, 77, 831-854. doi: 10.1177/0013164416667977 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trafimow D. (2018). An a priori solution to the replication crisis. Philosophical Psychology, 31, 1188-1214. doi: 10.1080/09515089.2018.1490707 [DOI] [Google Scholar]
- Trafimow D., MacDonald J. A. (2017). Performing inferential statistics prior to data collection. Educational and Psychological Measurement, 77, 204-219. doi: 10.1177/0013164416659745 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trafimow D., Myüz H. A. (2018). The sampling precision of research in five major areas of psychology. Behavior Research Methods, 51(5), 2039-2058. doi: 10.3758/s13428-018-1173-x [DOI] [PubMed] [Google Scholar]
- Trafimow D., Wang T., Wang C. (2019). From a sampling precision perspective, skewness is a friend and not an enemy! Educational and Psychological Measurement, 79, 129-150. doi: 10.1177/0013164418764801 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Online_Appendix for Making the A Priori Procedure Work for Differences Between Means by David Trafimow, Cong Wang and Tonghui Wang in Educational and Psychological Measurement




