Abstract
The discussion on the use and misuse of p-values in 2016 by the American Statistician Association was a timely assertion that statistical concept should be properly used in science. Some researchers, especially the economists, who adopt significance testing and p-values to report their results, may felt confused by the statement, leading to misinterpretations of the statement. In this study, we aim to re-examine the accuracy of the p-value and introduce an alternative way for testing the hypothesis. We conduct a simulation study to investigate the reliability of the p-value. Apart from investigating the performance of p-value, we also introduce some existing approaches, Minimum Bayes Factors and Belief functions, for replacing p-value. Results from the simulation study confirm unreliable p-value in some cases and that our proposed approaches seem to be useful as the substituted tool in the statistical inference. Moreover, our results show that the plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p-values when the null hypothesis is true. However, the MBFs of Edwards et al. [Bayesian statistical inference for psychological research. Psychol. Rev. 70(3) (1963), pp. 193–242]; Vovk [A logic of probability, with application to the foundations of statistics. J. Royal Statistical Soc. Series B (Methodological) 55 (1993), pp. 317–351] and Sellke et al. [Calibration of p values for testing precise null hypotheses. Am. Stat. 55(1) (2001), pp. 62–71] provide more reliable results compared to all other methods when the null hypothesis is false.
KEYWORDS: Ban of P-value, Minimum Bayes Factors, belief functions
1. Introduction
Our work has been inspired by a statement on statistical significance and p-values of the American Statistical Association (ASA) in 2016 [23]. They stated that p-value does not provide a good measure of evidence regarding a model or hypothesis whose validity or significance should not be based only on whether its p-value passes a specific threshold, for example, 0.10, 0.05, or 0.01. This statement indicates that in many scientific disciplines, the use of p-value to make the decision on tests of hypotheses may have led to a large number of wrong discoveries. Some researchers, especially the economists, who adopt significance testing and p-values to report their results, may felt confused by the statement, leading to misinterpretations of the statement. The econometric models and statistical tests have been intensively used by economists for interpreting causal effects, model selection, and forecasting model. The question is how to test and make an inference without p-values.
The discussion of this issue is not quite new. The critiques of this issue were started in Berkson [2], Rozeboom [16], Cohen [3], and the review of the studies attempting in banning p-values refers to Kline [13]. The motivation for banning p-values is a concern with the logic that underlies significance testing and p-value. One of the most prominent problems of the p-value is that many researchers misunderstand that p-value is the probability of the null hypothesis. Indeed, a p-value does not have any meaning in this regard [14] and [7]. The misconceptions about the interpretation of p-value are explained in the work of Assaf and Tsionas [1]. They provided a simple explanation of the problem in making an inference from p-value, for example, if the p-value is less than 0.05, we have enough evidence to reject the null hypothesis and accept the claim. By this conviction in the regression framework, we must reject the null hypothesis . While this is fine, the interpretation can be misleading as the p-value is only the probability of the observed results, regardless of the value of the . Intuitively, we make the same interpretation of under the range of p-values less than 0.05. This indicates that the p-value provides indirect evidence about the null hypothesis, as the parameters are assumed to be fixed for all p-values less than 0.05. In addition, it is known that the p-value might overstate the evidence against the null [7,12,21,1]. Another problem of using p-value is that there exhibits a high dependence of the p-value on the sample size. It was evident that a smaller sample size can yield higher p-value and vice versa, see Rouder et al. [17]. Thus, if we do not have large enough sample size, the interpretation might be wrong as it is difficult to obtain the accurate testing result, especially in the case that the null hypothesis must be accepted (the null hypothesis is true). Also, it is too dangerous to have only the binary decisions (e.g. whether to reject or accept the null hypothesis). It is the extreme binary view that, in our opinion, has caused various problems to the decision making. Stern [21] mentioned that non-significant p-value indicates that the data could easily be observed under the null hypothesis. However, the data could also be observed under a range of alternative hypotheses. Thus, it is overconfident to make the decision based on this binary approach and it may contribute to a misunderstanding of the test results.
Obviously, the p-value is currently misinterpreted, overtrusted, and misused in the research reports. Hence, this puts us in a difficult situation of testing against a null hypothesis. However, our discussion should not be taken as a statement for researchers and practitioners to completely avoid p-values. Rather, we should investigate some misconceptions about the p-values and find alternative methods that have better statistical interpretations and properties. Fortunately, there is an alternative approach to p-value. We refer to such an approach as the Bayesian counterpart, the Bayes factor method and the plausibility method. Our suggested methods are similar to some of the suggestions in the American Statistician in 2019 [24]. They have further discussed the problems of p-values and suggested the new guidelines for supplementing or replacing p-values. In this article, they suggested second-generation p-values, confidence intervals, false-positive risk and Bayes factor methods for replacing the conventional p-values.
We can start discussing the Bayes factor, which has been widely accepted as a valuable alternative to the p-value approach in these recent days (see [21,15,10,1]). Page and Satake [15] revealed that there are two main differences between p-value and Bayes factors. First, the calculation of p-value involves both observed data and ‘more extreme’ hypothetical outcomes, whereas the calculation of the Bayes factor can be obtained from observed data alone. Note that, in a Bayesian approach, the information from the observed data is normally combined with the priors for the parameter of interest. This is the point that sparks much of the debate regarding the Bayes methods, because the selection of prior may have much impact on the posterior distribution and conclusions. However, the calculation of the Bayes factor can be obtained from observed data alone by assuming the uniform prior on the parameter of interest. Second, a p-value is computed in relation to only the null hypothesis, whereas the Bayes factor considers both the null and alternative hypotheses. Many researchers confirmed that the Bayes factor is more suitable to address the problem of comparing hypotheses as it provides a natural framework for integrating a variety of sources of information about a quantity of interest. In other words, the statistical test based on this method relies on the combination of the information from the observed data and the prior information. Generally speaking, the prior information from the researcher is combined with the likelihood of the data to obtain the posterior distribution for constructing the Bayes factor. This posterior distribution explicitly addresses the information about the values of parameters, which are most plausible. Bayes factor becomes a measure of evidence regarding the two conflicting hypotheses and therefore, it can investigate whether the parameter of interest is equal to a specific value or not, say against , in the regression context. Thus, in practice, the Bayes factor can be computed by the ratio of the posterior distribution of the null hypothesis and alternative hypothesis. Held and Ott [9] confirmed that Bayes factor can be considered as an alternative to the p-value for testing hypotheses and for quantifying the degree to which observed data support or conflict with a hypothesis (this approach is discussed further below). The additional information can be obtained from Stern [21]. However, in this study, we focus on the evidence against a simple null hypothesis provided by the Minimum Bayes factor (MBF) approach. In other words, this approach transforms p-value to a Bayes factor for making a new interpretation from the observed data (see [9,10]). In this context, MBF is usually oriented as p-values in that the smaller values provide stronger evidence against the null hypothesis [8].
Another approach considered in this study is the plausibility-based belief function (plausibility method) as proposed by Kanjanatarakul, Denoeux, Sriboonchitta [11]. This method is the extension of the MBF concept. While MBF focuses on transforming p-value or (t-statistic) to obtain MBF, plausibility-based belief function considers transforming to the plausibility (similar to p-value). The discussion of this method as the alternative to p-value for testing the hypotheses is quite limited. Thus, we attempt to fill the gap of the literature and suggest using this method as another alternative method for testing hypotheses. The method allows us to obtain the plausibility of each parameter in the range that we are considering. For example, if we want to know whether , we can find the plausibilities . Thus, if we want to test whether , we measure the plausibility , and if the is less than 0.05, we accept this null hypothesis. We can see that the p-value and the plausibility seem to provide similar information. However, Kanjanatarakul, Denoeux, and Sriboonchitta [11] mentioned that these two measures are completely different in interpretations. The p-value is the probability, under the hypothesis which is based on the assumption of repeated sampling, and it takes into account, in the computation of the probability, values of t-statistics larger than specific criteria values corresponding to p-values of 0.10, 0.05, and 0.01. In contrast, the assertion indicates that there is a parameter with , whose likelihood conditional on the data is times the maximum likelihood. The closer the value of to zero, the higher the probability that . This is to say we can obtain the for all possible values of . This means that the value of plausibility is directly dependent on the value of . For more explanation of the Bayesian approach and the belief functions, refer to Shafer [20].
In this paper, we further explore these two methods as the alternative to the p-values by proving that under the same hypothesis, these two methods provide the direct probability statements about the parameter of interest and provide a more accurate and reliable result for inferential decision making. We conduct several experiments and illustrate the practical application of the methods using a simple regression model which is widely employed in many research works. The issues in the paper are further elaborated in another section. We will provide the background of the frequentistic p-value, the Bayes factor and plausibility methods (likelihood-based belief functions) in Section 2, followed by the experiment and real application studies in Sections 3 and 4, respectively. Finally, the conclusion is provided in Section 5.
2. Review of the testing methods for accepting or rejecting the null hypothesis
2.1. The p-value
Recall that the p-value is the probability of obtaining a test statistic equal to or more than the observed results under the assumption that the null hypothesis is true [21]. More precisely, it is a quantitative measure of discrepancy between the data and the point null hypothesis [7]. The basic rule for decision-making is that if the p-value of the sample data is less than the specific threshold or significant level at 0.01, 0.05, and 0.10, the result is said to be statistically significant and the null hypothesis is rejected. In this study, the simple linear regression is considered, and it can be written as follows
| (1) |
where is the dependent variable and is the independent variable. is the error term which is assumed to be independent and identically distributed (iid) normal distribution. Thus, the statistical test in this study is based on this normality assumption. To investigate the impact of the variable on , one needs to examine whether equals to zero. Hence the hypothesis in this problem can be set against the alternative hypothesis . Let be t-statistic criteria, then under the traditional p-value method, it is calculated as
| (2) |
| (3) |
where is the observed t-statistic for testing , computed by and denotes the cumulative standard normal distribution function. Then inference regarding is based on the p-value. For a better understanding of the misuse and misinterpretation of the p-value, let us provide a simulation example in this regard.
Example 1:
As an example, we simulate a situation where is true and is false. Thus, we consider a linear regression model with two independent variables. The parameter is omitted as we are only considering the effect of the independent variables on the dependent variable, thus the model becomes
(4) where and . is assumed to have a normal distribution with mean zero and variance . We simulate 1000 data sets with sample size N = 20, 100, and 500 using the specification of Eq. 4 and estimate all these data sets to obtain the p-values of and . This means that the parameter is assumed to have an insignificant effect on while there is statistical significance . The results are illustrated in Figure 1.
Figure 1.
Performance of p-values when is true (upper panel) and is false (lower panel).
Let us consider the lower panel of Figure 1; it is obvious that, when the null hypothesis is false or must be rejected, the result shows that p-values are all less than significant level at 0.01, except in the sample size . As we can observe that there is a probability that the null hypothesis is rejected and the results are fine as the bulk of the p-values are well less than 0.01. This enables us to have an accurate inference that we cannot reject the null hypothesis. There are however still chances that p-values fall over that 0.01 criterion. This indicates that the researchers and practitioners still have a chance to have a wrong interpretation (type II error), especially when the small sample size is used in the study.
Then we turn our attention to the parameter which we know that must be accepted. For all our 1,000 data sets significance tests, the spread of p-values is all over the place as a uniform distribution. This indicates that there is a high chance that the null hypothesis is rejected. Therefore, in this example, we can conclude that the decision-making based on p-values will be, more or less, arbitrary and the conclusion is made imprecise.
Furthermore, we also observe that the p-values in this simulation study do exhibit no dependence on the sample size when is true or must be accepted (upper panel) but high dependence on sample size when is false or must be rejected (lower panel). This illustrates that the probability of rejecting when is true depends on whereas the probability of rejecting the null hypothesis, when is true (type I error), does not depend on .
2.2. Bayes factor
As it is one of the powerful tools for making a statistical test, the Bayes factor is widely accepted as a valuable alternative to the p-value approach. Stern [21] mentioned that the ‘Bayes factor has a significant advantage over the p-value as it can address the likelihood of the observed data for each hypothesis and thereby treating the two symmetrically’. This is to say it is more realistic as it provides inferences and compares hypotheses for the given data we analyze. In this study, we focus on the Bayes factor and consider a point null hypothesis and additional priors and . In the Bayesian point of view, it is possible to compute the probability of a hypothesis conditionally on observed data in terms of the posterior; an appropriate statistic for comparing hypotheses is the posterior odds:
| (5) |
in which the ratio is called the Bayes factor of relative to This Bayes factor can be computed formally as
| (6) |
where . is the density function (likelihood function) and is prior density of . and are the posterior distribution under the null and the alternative hypothesis, respectively. If the prior probability , Bayes factor becomes the likelihood ratio of relative to . This ratio is termed the Bayes factor [12]. Thus, the Bayes factor provides a direct quantitative measure of whether the data have increased or decreased the odds of . As we know, the Bayesian approach consists of the likelihood of the data and the prior distribution of the parameter. The problem is what the appropriate prior is. In Eq. 6, the prior probabilities and are included, but when no information about prior distributions are available, the approach of Minimum Bayes Factor (MBF) can be used. Since the Bayes factors lie in the same range as p-values. Another way to compute this MBF is introduced by Edwards, Lindman, and Savage [5]. They mentioned that the Bayes factor (Eq. 6) is based on the observed data, and they also suggested that this MBF can be computed easily by treating p-value as that observed data, thus
| (7) |
The other option for obtaining the MBF is to back-transform to the underlying test statistic , which was used to calculate p-value (see, Eq. 3). Therefore, MBF conditional on t-statistic can be computed by
| (8) |
under the assumption that is one-to-one transformation. If this transformation does not hold, the p-based Bayes factor (Eq. 7) is preferred, since it is directly computed by the p-value. In this study, we focus on the evidence against a simple null hypothesis provided by MBF, as it is easy to compare with a p-value. To compute the MBF, we have to minimize the test-based Bayes factor based on Eq. 7 or Eq. 8. Thus, the MBF can be computed by
| (9) |
| (10) |
where is the maximum density function or likelihood of the optimal . The minimum Bayes factor is the smallest possible Bayes factor that can be obtained for a p-value in a certain class of distributions considered under the alternative [9].
Several methods for computing the MBF have been proposed since the pioneering work of Edwards et al. [5]. In this study, we also mention four methods as the tools for computing MBF with an emphasis on two-sided p-based and test-based Bayes factor. The formulas of these methods are provided in Table 1.
Table 1. The calculation of Minimum Bayes factors.
However, the interpretation of MBF is still different from the p-value approach. The transformation of p-value to MBF is called calibration (but it is not just a change of scale, like converting from Fahrenheit degrees to Celsius degrees). By considering MBF, we are in a different conceptual framework. The categorization of the Bayes factor is provided in Table 2. [9].
Table 2. Interpretation of MBF.
| Minimum | Interpretation |
|---|---|
| 1–1/3 | Weak evidence for |
| 1/3–1/10 | Moderate evidence for |
| 1/10–1/30 | Substantial evidence for |
| 1/30–1/100 | Strong evidence for |
| 1/100–1/300 | Very strong evidence for |
| < 1/300 | Decisive evidence for |
2.3. Plausibility-based belief function
Now, let us consider in this method. As we mentioned before, we can measure the plausibility using the belief function. Denoeux [4] justified the belief function on is built from the likelihood function. Thus, we can use the normal likelihood to quantify the plausibility of providing its value between zero and one as the same range as that of the p-value. The plausibility is given by the contour function
| (11) |
for any hypothesis , where is the relative likelihood . is the parameter value that maximizes the likelihood function. Clearly, the plausibility is rescaled to the interval [0, 1]. Thus under the normality assumption, is the value that maximizes the likelihood function.
| (12) |
When we take the derivative of the log likelihood with respect to , we obtain
| (13) |
Consequently, we can estimate the as
| (14) |
| (15) |
This method can be viewed as the extension of the minimum Bayes factor approach as the plausibility is computed as the ratio of the relative likelihood. However, it transforms the value of instead of p-value and t-statistic. Thus, it can be said that the plausibility is directly computed from any . Thus, it can be viewed as an alternative method to the p-value.
Example 2:
Let us consider the same Example 1. We illustrate the calculation of . In this example, we simulate one data set with a sample size . To generate this data, we set the seed of R software‘s random number generator as 1, set.seed(1) and the estimated results are provided in Table 3 and Figure 2.
Table 3. Regression coefficients (based on one simulated dataset, N = 50).
| Parameter | True value | estimate | |t-statistic| | p-value | ||
|---|---|---|---|---|---|---|
| 0 | −0.0203 | 0.1238 | 0.1640 | 0.8700 | 0.9860 | |
| 1 | 1.0045 | 0.1324 | 7.5860 | 0.0000 | 0.0000 |
Figure 2.
Marginal contour functions for the parameters and (based on one simulated dataset, N = 50). The vertical red line is the .
Table 3 provides the estimated parameters from Maximum likelihood estimator (MLE), together with the standard error of the parameters (Column 3), absolute t-statistics (Column 4), p-values (Column 5), and plausibilities . We can observe that the p-values and the plausibilities provide similar results as they report that both the p-value and the plausibility of are zero, indicating that is significantly different from zero. Likewise, both methods also give the same interpretation that the parameter is insignificant as both the p-value and the plausibility are higher than 0.01, 0.05 and 0.10. However, it is interesting to have a different degree between a p-value of and plausibility of . We find that is 0.9860 while the p-value is 0.8700. It can be said that p-value states the amount of evidence for accepting as times as less as the plausibility-based belief function does. If the is true, we can say that the p-value underestimates the true probability. The comparison of the plausibility-based belief function and the p-value is further discussed in Section 3.
3. Simulation experiment
Several alternative methods for making an inference are introduced in this study. If we need a statistical test, which one is the most preferable? and how do we compare the different approaches? To answer these questions, in this section, the experimental study is conducted using the simulated data. For comparison, we consider the cases where, after the tests, we can find out the truth. So, we simulate the data which we have already known the correct answer to the statistical test.
To further illustrate, we consider an experiment to make comparisons directly among p-value, Bayes factor and plausibility approaches under the linear regression context. We start with the following data generating process,
| (16) |
where and so that there is only a significant effect of on . , , and are generated from a normal distribution with mean zero and variance one. Six different sample sizes are simulated consisting of N = 10, N = 20, N = 50, N = 100, N = 200, and N = 500. 1,000 data sets are simulated for each sample size. Simulations were generated using random seeds to simplify replication. To compare the performance of each method, this study proposed the use of the percentage of incorrect inferences as the measure. For p-value, we use the conventional statistical inference, in which the p-value is equal to or lower than thresholds namely 0.10, 0.05, and 0.01, to make a decision about the null hypothesis. Likewise, the plausibility-based belief function is interpreted in the same way as the p-value. On the other hand, in the case of the minimum Bayes factor approach, the interpretation is different from the first two methods as we make the decision upon the MBF following the Held and Ott [9] labeled intervals as presented in Table 2. Our interest is to see whether these methods will reveal any non-significant outcome when the null is false and reveal the significant outcome when the null is true. The results of the method comparison are provided in the following Figures and Tables.
The p-value and the plausibility are reported in Tables 4 and 5, respectively. These two approaches provide a similar interpretation where their values less than 0.10, 0.05, and 0.01 are said to be significant. While, the MBF results, reported in Tables 6–9, provide another interpretation perspective.
Table 4. The percentage of the incorrect inferences under p-value approach (results based on 1000 simulated data sets).
| p-value approach | ||||||
|---|---|---|---|---|---|---|
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | |
| p-value <0.01 | 99.7 | 100 | 100 | 100 | 100 | 100 |
| p-value <0.05 | 99.9 | 100 | 100 | 100 | 100 | 100 |
| p-value <0.10 | 100 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%) ( > 0.10) | 0 | 0 | 0 | 0 | 0 | 0 |
| p-value < 0.01 | 2.1 | 1.9 | 2.2 | 1.1 | 1.3 | 1.2 |
| p-value < 0.05 | 8.7 | 7.7 | 5.6 | 5.2 | 4.5 | 5.7 |
| p-value < 0.10 | 13.5 | 13 | 10.2 | 9.5 | 10.3 | 10.3 |
| Incorrect inferences (%)(p-value < 0.10) | 13.5 | 13 | 10.2 | 9.5 | 10.3 | 10.3 |
Table 5. The percentage of the incorrect inferences under the Plausibility approach (results based on 1000 simulated data sets).
| Plausibility approach | ||||||
|---|---|---|---|---|---|---|
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | |
| plausibility < 0.01 | 100 | 100 | 100 | 100 | 100 | 100 |
| plausibility < 0.05 | 100 | 100 | 100 | 100 | 100 | 100 |
| plausibility < 0.10 | 100 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%)(plausibility > 0.10) | 0 | 0 | 0 | 0 | 0 | 0 |
| plausibility < 0.01 | 5.1 | 1.7 | 1.1 | 0.5 | 0.4 | 0.3 |
| plausibility < 0.05 | 11.5 | 5.3 | 3.4 | 1.9 | 1.8 | 2.3 |
| plausibility < 0.10 | 16.2 | 8.9 | 5.1 | 3.4 | 3 | 3.8 |
| Incorrect inferences (%)(plausibility > 0.10) | 16.2 | 8.9 | 5.1 | 3.4 | 3 | 3.8 |
Table 6. The percentage of the incorrect inferences under Minimum Bayes Factors of Goodman [6] (results based on 1000 simulated data sets).
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
|---|---|---|---|---|---|---|---|
| 1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
| 1/3–1/10 | Moderate evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
| 1/10–1/30 | Substantial evidence for | 0.2 | 0 | 0 | 0 | 0 | 0 |
| 1/30–1/100 | Strong evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 1/100–1/300 | Very strong evidence for | 0.9 | 0 | 0 | 0 | 0 | 0 |
| < 1/300 | Decisive evidence for | 98.3 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%) | 1.7 | 0 | 0 | 0 | 0 | 0 | |
| 1–1/3 | Weak evidence for | 81.7 | 82.7 | 85.8 | 86.7 | 85.8 | 84.8 |
| 1/3–1/10 | Moderate evidence for | 13.3 | 12 | 11.3 | 10.4 | 11.3 | 11.4 |
| 1/10–1/30 | Substantial evidence for | 2.9 | 3.5 | 1.6 | 1.9 | 1.6 | 2.9 |
| 1/30–1/100 | Strong evidence for | 1 | 1.4 | 1 | 0.6 | 1 | 0.6 |
| 1/100–1/300 | Very strong evidence for | 0.3 | 0.2 | 0.2 | 0.3 | 0.2 | 0.1 |
| < 1/300 | Decisive evidence for | 0.8 | 0.2 | 0.1 | 0.1 | 0.1 | 0.2 |
| Incorrect inferences (%) | 18.3 | 17.3 | 14.2 | 13.3 | 14.2 | 15.2 | |
Table 7. The percentage of the incorrect inferences under Minimum Bayes Factors of Edwards et al. [5] (results based on 1000 simulated data sets).
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
|---|---|---|---|---|---|---|---|
| 1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
| 1/3–1/10 | Moderate evidence for | 0.6 | 0 | 0 | 0 | 0 | 0 |
| 1/10–1/30 | Substantial evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 1/30–1/100 | Strong evidence for | 0.8 | 0 | 0 | 0 | 0 | 0 |
| 1/100–1/300 | Very strong evidence for | 1.0 | 0 | 0 | 0 | 0 | 0 |
| < 1/300 | Decisive evidence for | 97.1 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%) | 2.9 | 0 | 0 | 0 | 0 | 0 | |
| 1–1/3 | Weak evidence for | 95.2 | 95 | 95.6 | 97.1 | 97.2 | 96.7 |
| 1/3–1/10 | Moderate evidence for | 2.9 | 4.0 | 3 | 2.1 | 1.7 | 2.7 |
| 1/10–1/30 | Substantial evidence for | 1 | 0.7 | 1.2 | 0.4 | 0.8 | 0.4 |
| 1/30–1/100 | Strong evidence for | 0.4 | 0.1 | 0.2 | 0.4 | 0.3 | 0.1 |
| 1/100–1/300 | Very strong evidence for | 0.1 | 0.1 | 0 | 0 | 0 | 0.1 |
| < 1/300 | Decisive evidence for | 0.4 | 0.1 | 0 | 0 | 0 | 0 |
| Incorrect inferences (%) | 4.8 | 5 | 4.4 | 2.9 | 2.8 | 3.3 | |
Table 8. The percentage of the incorrect inferences under Minimum Bayes Factors of Vovk [22] and Sellke et al. [19] (results based on 1000 simulated data sets).
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
|---|---|---|---|---|---|---|---|
| 1–1/3 | Weak evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
| 1/3–1/10 | Moderate evidence for | 0.3 | 0 | 0 | 0 | 0 | 0 |
| 1/10–1/30 | Substantial evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
| 1/30–1/100 | Strong evidence for | 0.9 | 0 | 0 | 0 | 0 | 0 |
| 1/100–1/300 | Very strong evidence for | 1.0 | 0 | 0 | 0 | 0 | 0 |
| < 1/300 | Decisive evidence for | 97.2 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%) | 2.8 | 0 | 0 | 0 | 0 | 0 | |
| 1–1/3 | Weak evidence for | 94.5 | 94 | 95.1 | 96.6 | 96.7 | 95.4 |
| 1/3–1/10 | Moderate evidence for | 3.5 | 4.4 | 3.1 | 2.6 | 2.0 | 3.8 |
| 1/10–1/30 | Substantial evidence for | 1.1 | 1.3 | 1.6 | 0.4 | 1.0 | 0.6 |
| 1/30–1/100 | Strong evidence for | 0.3 | 0.1 | 0.2 | 0.4 | 0.3 | 0.1 |
| 1/100–1/300 | Very strong evidence for | 0.1 | 0.1 | 0 | 0 | 0 | 0.1 |
| < 1/300 | Decisive evidence for | 0.5 | 0.1 | 0 | 0 | 0 | 0 |
| Incorrect inferences (%) | 5.5 | 6 | 4.4 | 3.4 | 3.3 | 4.6 | |
Table 9. The percentage of the incorrect inferences under Minimum Bayes Factors of Sellke et al. [19] (results based on 1000 simulated data sets).
| N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
|---|---|---|---|---|---|---|---|
| 1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
| 1/3–1/10 | Moderate evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
| 1/10–1/30 | Substantial evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
| 1/30–1/100 | Strong evidence for | 0.7 | 0 | 0 | 0 | 0 | 0 |
| 1/100–1/300 | Very strong evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
| < 1/300 | Decisive evidence for | 98.7 | 100 | 100 | 100 | 100 | 100 |
| Incorrect inferences (%) | 1.3 | 0 | 0 | 0 | 0 | 0 | |
| 1–1/3 | Weak evidence for | 82.6 | 83.4 | 86.5 | 87.4 | 86.4 | 85.9 |
| 1/3–1/10 | Moderate evidence for | 11.6 | 10.5 | 8.6 | 9.2 | 10.2 | 9.5 |
| 1/10–1/30 | Substantial evidence for | 3.4 | 4.1 | 2.4 | 1.9 | 2.0 | 2.9 |
| 1/30–1/100 | Strong evidence for | 0.8 | 1.4 | 1.6 | 1.0 | 0.6 | 1.3 |
| 1/100–1/300 | Very strong evidence for | 0.8 | 0.3 | 0.7 | 0.1 | 0.6 | 0.2 |
| < 1/300 | Decisive evidence for | 0.8 | 0.3 | 0.2 | 0.4 | 0.2 | 0.2 |
| Incorrect inferences (%) | 17.4 | 16.6 | 13.5 | 12.6 | 13.6 | 14.1 | |
We provide the percentage of incorrect inferences to make the comparison of these three approaches. However, it is quite difficult to compare these approaches as the interpretation of significant results are different. Hence, in this experiment, decisive evidence for is considered as an acceptable decision favoring the alternative hypothesis while weak, moderate, substantial, strong and very strong evidence are considered as an acceptable decision favoring the null hypothesis . Likewise, p-value and plausibility are also used cut-offs, 0.1, .05, and 0.01, to make decisions about the null hypothesis. Therefore, if p-value and plausibility are less than 0.1, it is considered to reject the null hypothesis, otherwise, accept.
We begin our experiment with the case must be accepted as we set in Eq. (16). As we can observe in Tables 4–9, the plausibility calibration produces the lowest incorrect inferences compared to the p-value and the four MBFs, when N = 10. the percentage of incorrect inferences of both methods is 0%. However, if we consider the more restrictive decision point for both methods, say 0.05 and 0.01 criteria, we can observe the incorrect inferences of plausibility method are 0% thus favoring the , whereas the p-values at 0.05 and 0.01 criteria accept this alternative hypothesis 0.3% and 0.1%, respectively. This indicates that there is little chance that the p-value is misleading. Furthermore, using the 0.10 criterion, it is evident that the percentage of incorrect inferences of plausibility and p-value are 0%, indicating the high reliability of plausibility test when must be rejected. In the cases of MBF, among the four different methods, we find that MBF of Sellke et al. [19] performs well in this experiment as the percentage of the incorrect inferences is only 1.3%. Yet, this rate is still higher than those of p-value and plausibility methods. However, the results from these three approaches in the samples of N > 10 provide the same interpretation as the percentage of incorrect inferences are 0%.
Then, suppose now we consider the case that the null hypothesis must be rejected. As we can see in Tables 4–9, the heterogeneous results are provided. To make a clear picture, we summarize the percentage of incorrect inference results in Figure 3. Different lines indicate different methods. The results show that there is variability in the evidence of testing. Firstly, it can be seen in the right panel of Figure 3 where the percentage of incorrect inferences of all methods seems to decrease when the sample size increases to N = 100, and the constant trend of incorrect inferences maintains after N = 100. Secondly, the Minimum Bayes Factor of Edwards et al. [5] produces the lowest rate of incorrect inferences.
Figure 3.
Summary percentage of incorrect inference results.
For a closer look at the behavior of Minimum Bayes Factors of Edwards et al. [5] in Table 7, the percentage that this method finds moderate, substantial, strong, very strong, and decisive is always less than or equal to 5%. Meanwhile, from the p-value approach (Table 4), the percentage of incorrect inferences is ranged between 9.5–13.5%. This indicates that p-value states the amount of evidence against as approximately 2.3–3.7 times (computed by the percentage of incorrect inferences of p-value divided by that of MBF) as much as the MBF of Edwards et al. [5] does. In other words, the p-value exaggerates statistical significance around 2–4 times as much as the MBF. Therefore, we can confidently argue that the conclusion derived from p-values is less accurate as a measure of the strength of evidence against .
Although the plausibility approach is not performing well up to producing the lowest incorrect inferences in this case, its rate of incorrectness is still lower than the p-value approach for all sample sizes, except for N = 10. Table 4 reveals that the percentage of incorrect inferences from the p-value favoring decreases with the growth of the sample size. This indicates that plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p-value thresholds. Therefore, from an empirical or applied point of view, we could consider this alternative as a useful tool for researchers to avoid false discovery claims.
In addition, we also plot the boxplot for displaying the full range of variation of p-values, plausibility and MBFs, obtained from the same simulation results from Tables 4–9, in Figures 4–6, respectively. In all panels, the y-axis plots the probability values obtained from different methods and different sample sizes.
Figure 4.

The full range of variation (from min to max) of p-values.
Figure 5.
The full range of variation (from min to max) of plausibility (PL).
Figure 6.

The full range of variation (from min to max) of MBF
Considering the case that the null hypothesis must be rejected (the true is 3). In other words, there is strong evidence favoring the alternative hypothesis. As shown in the left panel of Figures 4–6, there is a small variation in the probability values for all methods. When the sample size is greater than 10, all methods show the evidence supporting the alternative hypothesis. However, in the case of small sample size, say N = 10, there is a number of times that our testing methods lead to misinterpretation. Among 1,000 simulated datasets, we can see that p-value favors one time when using the 0.05 criterion while the plausibility gives no evidence of supporting the alternative hypothesis. For the four MBFs, similar results are shown. The variation of MBFs is also similar to those of the p-value and plausibility approaches, except for N = 10. This indicates that the power of any test depends on the sample size. If the sample size is large enough, the test will be more reliable, especially when the null hypothesis must be rejected. However, there is no evidence supporting the reliability of the test for the case of the null hypothesis that must be accepted.
By using the decisive evidence for as the criterion, the number of times that MBFs produce the value greater than this criterion is relatively high compared to the p-value and plausibility approaches (Figure 6). These results indicate that there could be a misinterpretation of the hypothesis test when the number of observations is low. However, if we use the weak evidence for , , there is no evidence that MBF methods, except MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19], fall in the range of this criterion. This indicates that MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19] have a small chance of providing the wrong interpretation. This result corresponds to the results provided in Tables 7 and 8.
Furthermore, the variation of p-values, plausibility, and MBFs in the right panel of Figures 4–6 also provide another test of . Remind that the null hypothesis must be accepted or the null hypothesis is correct, . We can see that there exhibits a relatively high variation compared to the case of the null hypothesis is incorrect (as reflected by the higher heights of the boxes). Under this test, we notice that MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19], as illustrated in panels (b) and (c) in Figure 6, show the small variability as the height of the box plots is short. We also observe that the median of these two MBFs exceeds 0.9 indicating that there is weak evidence for However, we observe that there are some outliers located below , indicating that there is a small chance that the MBF favors the decisive evidence for Then, let us consider the MBFs of Goodman [6] and Sellke et al. [19]. In all sample sizes, the median MBFs are around 0.8 and 0.9, respectively. This indicates that these two methods seem to favor the weak evidence of an alternative hypothesis as well. Also, the outliers of these two MBFs are not present, meaning the high chance of favoring the decisive evidence for . Therefore, we can conclude that MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19] perform well and provide more accurate testing for the case the null hypothesis must be accepted. This result corresponds to the results provided in Tables 7 and 8.
In a nutshell, these simulation results provide evidence of the high performance of plausibility approach when the null hypothesis is correct and must be accepted. Meanwhile, the MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19] provide more reliable results compared to all other methods when the null hypothesis is incorrect. Yet. there is no evidence of 100% correct inferences under this case. This indicates that decision-making based on these approaches will be, more or less, arbitrary.
4. Illustrated example
Finally, we make a comparison among MBFs, plausibility and p-value using a real application on the impact of economic variables on energy price in Spain. We use a dataset from the R package ‘MSwM’ [18] covering Price of the energy in Spain and other economic variables namely Oil price , Gas price , Coal price , Exchange rate between Dollar-Euro , Ibex 35 index divided by one thousand and Daily demand of energy . The data were collected from the Spanish Market Operator of Energy (OMEL), the Bank of Spain and the U.S. Energy Information Administration, covering, January 1, 2002, to October 31, 2008. We consider the following linear regression model
| (17) |
The application results of the three statistical tests are provided in Table 10, where we list each covariate corresponding coefficient, the p-value, plausibility and four types of the Minimum Bayes factor.
Table 10. Results from the linear regression model.
| Coefficient | p-value | Plausibility | Goodman | Edwards et al. | Vovk and Sellke et al. | Sellke et al. | |
|---|---|---|---|---|---|---|---|
| −9.1253 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| 0.0284 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| 0.0430 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| −0.0021 | 0.2800 | 0.0000 | 0.5575 | 0.9936 | 0.9686 | 0.6423 | |
| 6.0403 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| −0.1590 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| 0.0089 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
Considering the test of in Table 10, we find strong evidence that six out of seven coefficients are favoring the alternative hypothesis. All three approaches provide the same interpretation of these coefficients. The p-value is less than 0.01, corresponding to the decisive evidence for the alternative hypothesis of four MBF methods. However, there exists a contradictory result in the interpretation of coefficient as different interpretation is made from each method. The p-value confirms the insignificant effect of coal price on the energy price of Spain, but the plausibility gives a significant result. Then from the MBF results, we can see all four MBF methods are categorized as weak evidence for the alternative hypothesis. These results seem to correspond to our simulation experiment in Section 3 which shows that the rate of incorrect inferences when the null hypothesis is incorrect is relatively high compared to the case when the null is correct. This application’s results suggest that the researchers need to be careful in interpreting a statistical result and various approaches should be used to crosscheck the result of one another.
5. Conclusion
In this paper, we highlight some of the misconceptions about the p-value and illustrate its performance using some simulated experiments and introduce two alternative approaches for the p-value namely the plausibility and the Minimum Bayes factor (MBF) as well as find the evidence against a simple null hypothesis under the linear regression context. MBF is an alternative to the p-value offered by the Bayesian approach, which relies solely on the observed samples to provide direct probability statements about the parameters of interest. In the case of plausibility approach, it can be viewed as the extension of the Minimum Bayes factor approach as the plausibility is computed as the ratio of the relative likelihood. However, it transforms the value of the parameter instead of p-value and t-statistic. Thus, it can be said that the plausibility is directly computed from any parameters.
The values of MBF and plausibility lie in the same range as p-value and this fact facilitates us to make the comparison. While the plausibility provides similar interpretation as p-value where 0.1, 0.05, 0.01 are given as the cut-offs or decision criteria, the MBF is interpreted following the Goodman [6] labeled intervals. As a result, a MBF between 1–1/3 is considered weak evidence for ; while 1/3–1/10 corresponds to moderate evidence, 1/10–1/30 substantial evidence, 1/30–1/100 strong evidence, 1/100–1/300 very strong evidence, and < 1/300 decisive evidence. To make a comparison of these three approaches, we conduct a simulation study to compare the incorrect inferences of each approach.
Our results show that the plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p-value when the null hypothesis is true and must be accepted. However, the MBFs of Edwards et al. [5]; and Vovk [22] and Sellke et al. [19] provide more reliable results compared to all other methods when the null hypothesis is false or must be rejected. Based on our results, there is no evidence of 100% correct inferences under this case. This indicates that decision-making based on these approaches will be, more or less, arbitrary, when the null hypothesis is incorrect. As we mention in the introduction, it is too dangerous to have only binary decisions. Hence, the decision making in favoring each hypothesis needs to consider the whole categorization of MBF in order to avoid this strong inference. In addition, we could consider these alternatives as useful tools for researchers to avoid false discovery claims based on the p-value.
Nevertheless, our discussion should not be taken as a statement for researchers and practitioners to completely avoid p-value. Rather, we should investigate some misconceptions about the p-value and find alternative methods that have better statistical interpretations and properties. Finally, we note that the research studies involve much more than the statistical interpretation stage. The researcher in this area should carefully make the interpretation of the results. Instead of banning or rejecting p-value all at once, we suggest considering all these statistical tests for achieving the reliable results of your work. Furthermore, we should consider non-statistical evidence, such as theory and real evidence, in supporting decision making. This will help us gain more reliable results.
Acknowledgments
The authors would like to thank the four anonymous reviewers, the editor, and Prof. Hung T. Nguyen for his helpful comments and suggestions. The financial support of this work is provided by Center of Excellence in Econometrics, Chiang Mai University.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Assaf A.G., and Tsionas M., Bayes factors vs. P-values. Tourism Manag. 67 (2018), pp. 17–31. doi: 10.1016/j.tourman.2017.11.011 [DOI] [Google Scholar]
- 2.Berkson J., Tests of significance considered as evidence. J. Am. Stat. Assoc. 37 (1942), pp. 325–335. doi: 10.1080/01621459.1942.10501760 [DOI] [Google Scholar]
- 3.Cohen J., The earth is round (p < .05). Am. Psychol. 49 (1994), pp. 997–1003. doi: 10.1037/0003-066X.49.12.997 [DOI] [Google Scholar]
- 4.Denœux T., Likelihood-based belief function: justification and some extensions to low-quality data. Int. J. Approx. Reason. (in press) (2014). [Google Scholar]
- 5.Edwards W., Lindman H., and Savage L.J., Bayesian statistical inference for psychological research. Psychol. Rev. 70(3) (1963), pp. 193–242. doi: 10.1037/h0044139 [DOI] [Google Scholar]
- 6.Goodman S.N., Toward evidence-based medical statistics. 1: The P value fallacy. Ann. Intern. Med. 130(12) (1999), pp. 995–1004. doi: 10.7326/0003-4819-130-12-199906150-00008 [DOI] [PubMed] [Google Scholar]
- 7.Goodman S., A dirty dozen: twelve p-value misconceptions, in Seminars in Hematology, WB Saunders, 2008, July. Vol. 45, No. 3, pp. 135–140. [DOI] [PubMed] [Google Scholar]
- 8.Held L., A nomogram for P values. BMC Med. Res. Methodol. 10(1) (2010), pp. 21. doi: 10.1186/1471-2288-10-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Held L., and Ott M., How the maximal evidence of p-values against point null hypotheses depends on sample size. Am. Stat. 70(4) (2016), pp. 335–341. doi: 10.1080/00031305.2016.1209128 [DOI] [Google Scholar]
- 10.Held L., and Ott M., On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5 (2018), pp. 393–419. doi: 10.1146/annurev-statistics-031017-100307 [DOI] [Google Scholar]
- 11.Kanjanatarakul O., Denoeux T., and Sriboonchitta S., Prediction of future observations using belief functions: a likelihood-based approach. Int. J. Approx. Reason. 72 (2016), pp. 71–94. doi: 10.1016/j.ijar.2015.12.004 [DOI] [Google Scholar]
- 12.Kass R.E., and Raftery A.E., Bayes factors. J. Am. Stat. Assoc. 90(430) (1995), pp. 773–795. doi: 10.1080/01621459.1995.10476572 [DOI] [Google Scholar]
- 13.Kline R.B., Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, 2nd ed. American Psychological Association, Washington, 2013. [Google Scholar]
- 14.Marden J.I., Asymptotic distribution of P values in Composite null models: Comment. J. Am. Stat. Assoc. 95(452) (2000), pp. 1164–1166. [Google Scholar]
- 15.Page R., and Satake E., Beyond P values and hypothesis testing: using the Minimum Bayes factor to Teach statistical inference in Undergraduate Introductory statistics Courses. J. Education Learning 6(4) (2017), pp. 254. doi: 10.5539/jel.v6n4p254 [DOI] [Google Scholar]
- 16.Rozeboom W.W., The fallacy of the null-hypothesis significance test. Psychol. Bull. 57(5) (1960), pp. 416–428. doi: 10.1037/h0042040 [DOI] [PubMed] [Google Scholar]
- 17.Rouder J.N., Speckman P.L., Sun D., Morey R.D., and Iverson G., Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16(2) (2009), pp. 225–237. doi: 10.3758/PBR.16.2.225 [DOI] [PubMed] [Google Scholar]
- 18.Sanchez-Espigares J.A., and Lopez-Moreno A., MSwM: Fitting Markov-Switching Models, R package version 1.2, 2014. [Google Scholar]
- 19.Sellke T., Bayarri M.J., and Berger J.O., Calibration of p values for testing precise null hypotheses. Am. Stat. 55(1) (2001), pp. 62–71. doi: 10.1198/000313001300339950 [DOI] [Google Scholar]
- 20.Shafer G., Perspectives on the theory and practice of belief functions. Int. J. Approx. Reason. 4(5-6) (1990), pp. 323–362. doi: 10.1016/0888-613X(90)90012-Q [DOI] [Google Scholar]
- 21.Stern H.S., A test by any other name: P values, Bayes factors, and statistical inference. Multivariate. Behav. Res. 51(1) (2016), pp. 23–29. doi: 10.1080/00273171.2015.1099032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Vovk V.G., A logic of probability, with application to the foundations of statistics. J R Stat Soc. Series B (Methodological) 55 (1993), pp. 317–351. doi: 10.1111/j.2517-6161.1993.tb01904.x [DOI] [Google Scholar]
- 23.Wasserstein R.L., and Lazar N.A., The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70(2) (2016), pp. 129–133. doi: 10.1080/00031305.2016.1154108 [DOI] [Google Scholar]
- 24.Wasserstein A.L.S., and Lazar N.A., Moving to a World Beyond “p <0.05”. Am. Stat. 73(sup1) (2019), pp. 1–19. doi: 10.1080/00031305.2019.1583913. [DOI] [Google Scholar]




